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Editor's 
Introduction 



No matter how powerful the under- 
lying hardware, most important to 
users is how that power translates to 
greater application performance and 
availability. Among the div erse topics 
in this issue of the Journal are inno- 
vative wavs engineers have devised 
to meet application performance and 
availability requirements, and new 
tools for applications developers. 

DIGITAL FX!32 is a unique soft- 
ware product that makes available 
hundreds of applications written 
for Intel machines to users of Alpha 
machines. Described bv llay Hook way 
and Mark Herdeg,FX!32 combines 
software emulation and advanced 
binary translation techniques toenablc 
32-bit applications that run on Intel- 
based machines with Windows NT 
to also run on 64-bit RISC Alpha- 
based machines with Windows NT. 
The design provides both the perfor- 
mance benefits and the transparency 
of operation that the project engi- 
neering team sought for users. 

Also designed for the Windows 
environment is DIGITAL Visual 
Fortran, a tool for Fortran developers 
that combines technologies from 
DIGITAL and Microsoft Corpora- 
tion. Leo Treggiari reviews the tool's 
components, which include the 
Component Object Model (COM), 
Fortran 90, and Microsoft Developer 
Studio. He addresses the question of 
why developers need help accessing 
dynamic link libraries and servers 
based on COM, and then focuses on 
the newly created tool that provides 
this functionality, the Fortran Module 
Wizard. 



DIGITAL'S sliared-mcmorv cluster 
interconnect, MEMORY CHANNEL 
2, delivers the high levels of compu- 
tational performance necessarv to 
support the largest technical and 
commercial applications. Marco Fillo 
and Rick Gillett assess experiences 
with the first implementation of 
MEMORY CHANNEL that led to 
such enhancements as the cross-bar 
design in this latest implementation. 
They conclude with performance 
data that demonstrate unparalleled 
performance in terms of latency and 
bandwidth compared with traditional 
interconnects. MEMORY CHANNEL 
2 provides latency of less than 2.2 
microseconds and bandwidth of 
1 ,000 megabytes per second in an 
8 -node cluster. 

Data security has long been impor- 
tant to svstem managers but nor easily 
achieved in distributed heterogeneous 
systems. DIGITAL and BF.A Systems 
have integrated ObjectBroker middle- 
ware with the Distributed Computing 
Environment's Generic Security Service 
Application Programming Interface 
(GSS-API), as described here by John 
Parodi and Fred Burgher. The authors 
examine the choice of GSS-API for 
ObjectBroker and future directions 
in authentication software. 

Design decisions made in the devel- 
opment of DIGITAL'S StrongARM 
microprocessor were driven bv the 
sometimes opposing requirements 
of high performance and low pow er 
consumption. Targeted for use in 
handheld appliances usually powered 
bv conventional batteries, StrongARM 
offers significantly higher performance 



than comparable microprocessors: It 
operates at 160 MHz, dissipating less 
than 450 milliwatts. James Montanaro, 
Rich Witek et al. step through the 
decisions designers made to imple- 
ment the ARM V4 instruction set 
from Advanced RISC Machines Ltd. 

Upcoming in the next issue of 
the Journal arc technical papers 
about new AltaVista software and 
a new Windows NT personal work- 
station based on an Alpha 64-bit 
RISC processor. To view the results 
of a recent survey sent to Journal 
Web subscribers, see http://w\vw. 
digital.com/info/dtj. 
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DIGITAL FXI32: 
Combining Emulation 
and Binary Translation 



I 

Raymond J. Hookway 
Mark A. Herdeg 



The DIGITAL FX!32 software product uniquely 
combines emulation and binary translation 
to enable any 32-bit application that executes 
on an Intel x86 microprocessor running the 
Windows NT 4.0 operating system to be installed 
and to execute on an Alpha microprocessor run- 
ning Windows NT 4.0. Benchmark tests indicate 
that after translation, x86 applications run as 
fast on a 500-MHz Alpha system with DIGITAL 
FXI32 software installed as on a 200-MHz Pentium 
Pro system. The emulator and its associated run- 
time software provide transparent execution 
of applications written for x86-based platforms. 
The emulator produces profile data that is used 
by the translator and takes advantage of trans- 
lation results as they become available. The 
translator provides native Alpha code for the 
portions of an x86 application that have previ- 
ously been executed. A server manages the 
translation process for the user, making the 
process completely transparent. 



Three factors contribute to the success of a micro- 
processor: price, performance, and software availability. 
The DIGITAL FX!32 product addresses the third fac- 
tor, software availability, by making hundreds of new 
applications available on Alpha-based platforms run- 
ning the Windows NT operating system. DIGITAL 
FXI32 software combines emulation and binary trans- 
lation to provide fast, transparent execution of Intel 
x86 applications on Alpha systems. 

Since its introduction in 1992, the Alpha micro- 
processor has been the fastest microprocessor 
available. A large number of native applications are 
available on Alpha systems, particularly those applica- 
tions that require a high-performance processor. With 
the introduction of DIGITAL FX!32 software, 32-bit 
programs that can be installed and executed on x86 
systems running the Windows NT 4.0 operating sys- 
tem can also be installed and executed on Alpha sys- 
tems running Window NT 4.0. Except for hav ing to 
specif)' that a program is an x86 application, installing 
and running an application is the same on an Alpha 
system as on an x86 system. The performance of an 
x86 application running on a high-end Alpha system is 
similar to the performance of the same application 
running on a high-end x86 system. 

A number of svstems have successfully used emula- 
tors to run applications on platforms for which the 
applications were not initially targeted.'- 2 The major 
drawback has been poor performance. 2 Several emula- 
tors have used dynamic translation, translating small 
segments ofa program as it is executed, to achieve better 
performance than that obtained by an interpreter 
alone. 2 ^ Dynamic translation involves a basic trade off 
between the amount of time spent translating and the 
resulting benclit of the translation. If an emulator spends 
too much time on the translation and related processing, 
the executing program will be unresponsive. This limits 
the optimizations that can be performed by the emula- 
tor using dynamic translation. 

FX! 32 overcomes the performance problem by not 
doing any translation while the application is execut- 
ing. Rather, FXI32 captures an execution profile that is 
later used by a binary translator 5 to translate into native 
Alpha code those parts of the application that have 
been executed. Since the translator runs in the back- 
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ground, it can use computationally intensive algo- 
rithms to improve the quality of the generated code. 
To our knowledge, FX!32 is the first system to explore 
this combination of emulation and binary translation. 

In this paper, we describe how FX!32 works. We begin 
with an overview and discuss each of the major compo- 
nents in more derail. We then present some benchmark 
test results and briefly describe several limitations of the 
current version of DIGITAL FXI32 software. 

Overview 

On Alpha systems, the Windows NT operating system 
uses an emulator to run 16-bit x86 applications. These 
applications can be installed and run in the same way as 
they are installed and run on x86 systems, but the exe- 
cution is slower. The emulator built into FX!32 pro- 
vides a similar capability for 32-bit x86 applications. 

Unlike the emulation software in the 16-bit envi- 
ronment, FXI32 provides a binary translator that 
translates 32 -bit x86 applications into native Alpha 
code. The translation is done in the background and 
requires no user interaction. Using background trans- 
lation allows the translator to perform optimizations 
that, in terms of computational resources, would be 
too expensive to accomplish while an application is 
running. An application translated by means of FX!32 
runs up to 10 times faster than the same application 
running under the emulator. 

DIGITAL FX!32 software consists of the following 
seven major components: 

1 . The transparency agent, which provides for trans- 
parent launching of 32-bit x86 applications. 

2. The runtime, which loads x86 images and sets up 
the run-time environment to execute them. As part 



of loading an image, the runtime component jack- 
ets imported application programming interface 
(API) routines. Jackets are small code fragments 
that allow the x86 code to call Alpha Windows NT 
API routines. 

3. The emulator, which runs an x86 application mak- 
ing use of translated code when it is available. 

4. The translator, which produces a translated image 
using profile information received from the emulator. 

5. The database, which stores execution profiles pro- 
duced by the emulator and used by the translator. 
Translated images are also stored in the database, 
along with configuration information. 

6. The server, which maintains the database and runs 
the translator as appropriate. 

7. The manager, which allows the user to control 
resources used by the DIGITAL FX!32 software. 

Figure 1 shows the relationships between these 
major components, each of which is discussed in more 
detail in the sections that follow. 

The Transparency Agent 

The transparency agent provides for transparent 
launching of 32-bit x86 applications. Launching an 
application on the Windows NT operating system 
always results in a call to the CreateProcess API routine. 
By hooking calls to this routine, the transparency agent 
can examine every image as it is about to be executed. 
If a call to CreateProcess specifies that an x86 image is 
to be executed, the transparency agent invokes the run- 
time component to execute the image. 

FXI32 inserts the transparency agent into the address 
space of each process. A process that contains the trans- 



TRANSPARENCY 
AGENT 



RUNTIME 
AND 

EMULATOR 




DATABASE 
<REGISTRY> 








f ■«, 

— - 


TRANSLATED 




EXECUTION 


IMAGES 




PROFILES 


^ > 




v ' 



BINARY 
TRANSLATOR 



SERVER 



Figure 1 

DIGITAL FX!32 System Components 



Digital Technical Journal 



Vol. 9 No. 1 1997 



parency agent is said to be enabled. Once a process is 
enabled, any attempt to execute an x86 image causes 
the runtime to be invoked to execute the process. The 
agent is propagated through the system because each 
attempt to create a process to run an Alpha image 
results in that created process being enabled. 

By the time a user is logged on, FX!32 has enabled 
all the top-level processes, and any attempt to execute 
a 32-bit x86 application invokes the runtime compo- 
nent. The initial processes that are enabled are the 
Windows shell (explorer.exe), the service control man- 
ager (services.exe), and the remote procedure call 
server (rpcss.exe). When FX!32 is installed, the 
fx32strt.exe file is registered as the Windows shell. 
When a user logs on, rx32strt.exe runs and enables the 
real Windows shell, explorer.exe. The FX! 32 server 
enables the service control manager when it starts, 
usually when the system is booted. Currently, any ser- 
vice process that is started by the service control man- 
ager before the server is started is not enabled. (The 
only exception is rpcss.exe, w hich is explicitly enabled 
by the server). We hope to alleviate this limitation in a 
future version of the DIGITAL FXI32 software. 

Processes are enabled using a technique described 
by Jeffrey Richter in Chapter 16 of his book 
Advanced Windows. NT' to inject a copy of the trans- 
parency agent into the process' address space. 

The Runtime 

The transparency agent invokes the runtime whenever 
an attempt is made to execute an x86 image. The 
runtime loads the image into memory, sets up the run- 
time environment required by the emulator, and then 
calls the emulator to execute the image. 

The runtime replaces the Windows NT loader, 
which can only load Alpha images; the Windows NT 
loader returns an error reporting an image of the 
wrong architecture if it is invoked to load an ,\86 
image. The runtime duplicates the functionality of the 
Windows NT loader, which includes relocating images 
that are not loaded at their preferred base address, set- 
ting up shared sections, and processing static thread 
local storage sections. 

The runtime registers each image it processes with 
the Windows NT operating system by inserting point- 
ers to that image into various lists that are used inter- 
nally by the system. Maintaining these lists allows the 
native Windows NT code to correctly implement API 
routines, such as Load Resource and GetAlodulcHandlc, 
which require access to images that have been loaded. 
The registration also ensures that the DIIMain func- 
tions of the loaded dynamic link libraries (DLLs) are 
called as appropriate. (The entry points of x86 DLLs 
are jacketed by the runtime.) 

Fortunately, the image lists that FXI32 must modify 
are in the user's address space, and no modification of 



the Windows NT operating system was required to 
register images with the system. Unfortunately, the 
structure of these lists is not part of the documented 
Win32 interface, and using them creates a dependency 
on the Windows NT version that is being run. FX! 32 
has dependencies on a number of undocumented fea- 
tures of the Windows NT operating system. Although 
the DIGITAL FX!32 product is more dependent on a 
particular version of the operating system than a typi- 
cal layered application is, it is remarkable that the 
implementation of FX!32 did not require any changes 
to the Windows NT operating system. 

The runtime also registers the image in the FX!32 
database. This database maintains information about 
x86 images that have been loaded, including the appli- 
cation that loaded the image, profile data that was pro- 
duced by the interpreter, and any translation of the 
image. The runtime accesses the database with a 
unique image identifier (ID), which the runtime 
obtains by hashing the image's header. Therefore, the 
image ID is determined by the content of the image, 
not by its location in the file system, and the informa- 
tion that FX! 32 associates with the image can be 
accessed independently of the image's location on the 
disk. For example, if an application is installed in one 
directory and some of the images loaded bv the appli- 
cation are subsequently translated by FX!32, the trans- 
ited images will be located by FX!32 even if the 
application is later installed in a different directory. 

When the runtime finds a translated image in the 
database, it loads this image along with the corre- 
sponding x86 image. Translated images are normal 
DLLs, loaded by the native LoadLibrary API routine. 
Translated images contain additional sections that 
store information required by the runtime to map x86 
routines to the corresponding Alpha code. 

The runtime duplicates the Windows NT loader 
function of binding an image's imports, using sym- 
bolic information in the image to locate the address of 
the imported routine or data. The runtime treats 
imports that refer to entries in Alpha images specially, 
however, by redirecting the imports to refer to the 
correct jacket entry in the FX!32 DLL, jacket.dll. 

The jacket routines in jacket.dll enable an x86 user 
program to call the native Alpha implementation of 
the Win32 API. These jacket routines are extremely 
important because they allow x86 applications to use 
high-performance code that has been tuned to the 
Alpha platform. Some x86 applications run faster on 
the Alpha platform than on the x86 platform, even 
without being translated, because of the large amount 
of time the applications spend in native DLLs. 

Each jacket contains an illegal x86 instruction that 
serves as a signal to the interpreter that a change is to 
be made to the Alpha environment. The interpreter 
calls an Alpha jacket routine at a fixed offset from the 
illegal x86 instruction. The basic operation of most 
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jacket routines is to move arguments from the ,\86 
stack to the appropriate Alpha registers, as dictated by 
the Alpha calling standard. Some jacket routines pro- 
vide special semantics for the native routine being 
called, as required by FX!32. For example, the jacket 
for the GetSystcmDirectory routine returns the path 
to the FX! 32 directory rather than the path to the true 
system directory so that x86 applications do not over- 
write native Alpha DLLs. 

For an x86 application to run under FX!32, every 
image it loads must be either an x86 image or an Alpha 
image for which jackets exist. Therefore, FX!32 pro- 
vides jackets for all the DLLs that implement the 
Win32 interface and for many redistributable DLLs. 
FX!32 currently provides jackets for more than 50 
native Alpha DLLs, which has enabled the FX!32 devel- 
opment team to run almost all the commercial applica- 
tions tested. Each new release of DIGITAL FXJ32 
software provides additional jackets, and the developers 
intend to jacket new interlaces as they are released. 

The Emulator 

The fundamental job of the emulator is to run x86 
applications before they are translated. The first time 
an x86 image executes under FX132, the image is exe- 
cuted by the emulator. 

The emulator also serves as a backup for translated 
code. Because it is not possible to statically determine 
all the code that can ever be executed by an application 
(especially for applications that generate code on-rhe- 
fly), the emulator is always present to execute such 
untranslated x86 application code. Previous binary 
translators built by DIGITAL also depended on the 
presence of an emulator in this role. 5 Emulator perfor- 
mance is more of an issue for FXI32 because, unlike 
those earlier binary translators, all application code is 
interpreted when the x86 application is first run. 

The emulator is an Alpha assembly language program 
that interprets the subset of x86 instructions that can be 
executed by a Win32 application. While an x86 applica- 
tion is running, the x86 processor state is kept partially 
in Alpha registers and partially in a per-thread data 
structure called die CONTEXT. The x86 integer regis- 
ters arc permanently mapped to Alpha registers, and 
Alpha registers store the state of the x86 condition 
codes. While the emulator is running, a dedicated Alpha 
register points to the CONTEXT. The CONTEXT 
stores the x86 per-thread processor context and any part 
of the x86 processor state that must be maintained 
across calls to other parts of the system, for example, 
calls to Alpha API routines. 

Pipelined Dispatch 

The structure of the emulator is a classic fctch-and- 
evaluate loop. The emulator dispatches on the first 
two bytes of each instruction, performing the lookup 



in a table of 64K entries. Each entry contains the 
address of the routine to execute to interpret an 
instruction and the length of the instruction. 

The structure of the dispatch loop has been care- 
fully crafted to make efficient use of 64-bit Alpha reg- 
isters and to efficiently schedule the execution of code 
in the loop. Software pipelining is used to overlap the 
fetch and dispatch table lookup for the next instruc- 
tion with the execution of the current instruction. 
At the top of the loop, at least eight bytes, starting at 
the address of the current instruction, are in Alpha 
registers. Length information from the dispatch table 
determines the first two bytes of the next instruction, 
allowing the dispatch table lookup to be overlapped 
with the execution of the current instruction. A fetch 
of additional bytes from the instruction stream is also 
initiated. Finally, the loop dispatches to the routine 
whose address was obtained from the table on the pre- 
vious iteration of the loop. 

The individual routines have been factored by using 
subroutines and coroutines to perform operations like 
operand fetching, making them as small as possible. As 
a result, the emulator code required to execute the 
most frequently executed x86 instructions fits in the 
first-level cache. 

Condition Code Evaluation 

Condition codes are generated by the execution of 
many of the x86 instructions. We have observed that 
condition codes are frequently set and relatively 
infrequently examined. The emulator takes advan- 
tage of this by evaluating the condition codes only 
when they arc used, that is, by using a "lazy evalua- 
tion" technique. The execution of a typical instruc- 
tion saves only enough state to allow the evaluation 
of condition codes, if required, at a later time. This 
takes much less effort than initially evaluating the 
condition codes. The additional advantage in defer- 
ring the evaluation is that only the condition codes 
that are used need to be generated. For example, the 
overflow condition code may never be computed if 
only the zero Hag is used. 

Floating-point Instruction Emulation 

The 80-bit x86 floating-point registers arc modeled 
by a stack of 64-bit memory locations that contain 
floating-point values. The decision to use 64-bit inter- 
mediate values, rather than to faithfully replicate the 
80-bit model, was based on the need to achieve good 
performance when executing x86 floating-point code 
on the Alpha processor. This decision was supported 
by the fact that the Windows NT operating svstem also 
uses a 64-bit floating-point model. Although this is an 
approximation, our experience to date has shown that 
this was a good compromise. Very few applications 
rely on the full precision provided by the x86 floating- 
point unit's (FPU's) 80-bit registers. 
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The emulator also implements a somewhat simpli- 
fied model of "the x86 FPU's register file. Most instruc- 
tions use the x86 FPU register file as a traditional 
operand stack; however, several instructions can create 
a register file state that is not strictly a stack by freeing 
registers in the middle of the stack, by moving the 
stack pointer without pushing or popping, or by ini- 
tializing the register file in a way that breaks the stack 
model. Modeling the full complexity of the x86 FPU 
register file would be extremely expensive, and experi- 
ence has shown that almost all programs use the regis- 
ter file strictly as a stack. The current version of the 
emulator takes advantage of this. We are investigating 
ways to model the floating-point registers in a way that 
maintains good performance but does not depend on 
their being treated as a stack. 

Generation of Profiles 

While it is interpreting an x86 program, the emulator 
generates profile data for use by the translator. The 
profile data includes the fol lowing information: 

■ Addresses that are the targets of call instructions 

■ (Source address, target address) pairs for indirect 
control transfers 

■ Addresses of instructions that make unaligned ref- 
erences to memory 

The translator uses this information to generate 
routines, that is, units of translation that approximate 
a source code routine. The emulator generates profile 
data by inserting values in a hash table whenever a rel- 
evant instruction is interpreted. For example, as part of 
interpreting the call instruction, the emulator makes 
an entry in a hash table that records the target of the 
call. When an image is unloaded (either as a result of a 
call on the FreeLibrary routine or when the applica- 
tion exits), the runtime processes the hash table to 
produce a profile file for that image. This profile is 
processed by the server and can result in the server 
invoking the translator to create a new translation of 
the image. 

To detect available translated code, the emulator 
uses the same hash table that it employs to gather the 
profile data. The x86 addresses for which there are 
translated routines and the address of the correspond- 
ing translated code are entered into the hash table by 
the runtime when it loads an x86 image that has been 
translated. When a call instruction is interpreted, the 
emulator looks up the target address. If a correspond- 
ing translated address exists, the emulator transfers 
control to that address. 

The Translator 

The server invokes the translator to translate x86 
images for which a profile exists in the database. The 
translator uses the profile to produce a translated 



image. On subsequent executions of the image, the 
translated code is used, substantially speeding up the 
application. 

Structure and Order of Operations 

The translator has eight major components (or phases): 
the regionizer, build, the register mangier, the condi- 
tion code mangier, improve, the code selector, the 
scheduler, and the assembler. (An additional phase 
that performs various peephole optimizations is dis- 
abled in the DIGITAL FXI32 VI. 0 translator.) The 
major components function as fol lows: 

1. The Regionizer — The regionizer uses data in the 
profile to divide the source image code into rou- 
tines, which are described in the section Generation 
of Profiles. Each call target in the profile is used to 
generate an entry to a routine. The regionizer rep- 
resents routines as a collection of regions. Each 
region is a range of contiguous addresses, which 
contains instructions that can be reached from the 
entry address of the routine. Unlike basic blocks, 
regions can have multiple entry points. The small- 
est collection of regions that contain all the instruc- 
tions that can be reached from the routine entry is 
used to represent the routine. Many routines have a 
single region. This representation was chosen to 
efficiently describe the division of the source image 
into units of translation. 

The regionizer builds routines by following the 
control flow of the source image. When an indirect 
jump instruction is encountered while following 
the control flow, the possible targets of the instruc- 
tion are obtained from the profile. Without this 
profile information, it would be verv difficult to 
reliably identify these targets, and indirect jumps 
would have to be treated as returns from the rou- 
tine. The profile information makes it possible to 
reliably generate a more complete representation of 
routines with correct control flow. 

After the regionizer runs, each of the other major 
components is run in sequence for each routine. 

2. Build — Build reparscs the x86 instructions in the 
routine to create an internal representation (IR) of 
the routine for use by the subsequent components. 
The IR is a graph of basic blocks and is similar to the 
IR used by many optimizing compilers. 

3. The Register Mangier — The initial I R is a straight- 
forward representation of the source x86 code. 
This representation ignores the overlap of the x86 
registers; the IR treats each occurrence of FAX, 
AX , AH , and AL as a separate register. The register 
mangier adds insert and extract operations as nec- 
essary to represent the actual semantics of the x86 
registers. 
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4. The Condition Code Mangier — The effect of'x86 
instructions on condition codes is represented 
implicitly in the initial IR. The condition code man- 
gier adds instructions to explicitly generate condi- 
tion codes. Since the condition code mangier 
understands the control flow oft he entire routine, 
it knows when condition codes are live and only 
adds code to generate condition codes when they 
are used later in the routine. 

5. Improve — Improve performs several transforma- 
tions that produce code more suited to the Alpha 
architecture. In the initial IR, each push and pop 
instruction is explicitly represented as a decrement/ 
increment of the x86 stack pointer, accompanied by 
a store/load. Improve collects all the manipulation 
of the x86 stack pointer into a single decrement at 
the beginning of a basic block and a single incre- 
ment at the end of that block. Improve also uses 
simple value numbering and analysis of memory 
references to try to eliminate loads and stores to 
both the x86 stack and the floating-point stack and 
to perform constant folding. Although Improve 
performs only relatively simple optimizations on a 
single basic block, we have found it to be quite 
effective in improving the quality of the code that is 
generated . 

6. The Code Selector — The code selector transforms 
the IR from a representation that contains mostly 
x86 instructions to one that contains only Alpha 
instructions. This transformation is done instruction 
by instruction, with each x86 instruction being 
replaced by a sequence of Alpha instructions that 
produce the same effect. The implementation of the 
code selector is based on the TVVIG code generator. 7 
Although the code selector is capable of dealing 
with much more complicated patterns of instruc- 
tions, this capability is not currently used. 

7. The Scheduler — After the code selector is run, all 
the instructions in the IR are Alpha instructions. 
The scheduler reorders the instructions within a 
basic block to minimize the cycle count for the tar- 
get processor. 

8. The Assembler — The assembler builds the output 
translated image. 

Use of Prof He Data 

The regionizer is the only component of the current 
translator that uses the control flow information in the 
profile. The regionizer uses the profile to determine 
which parts of the source image arc translated. Future 
versions of the translater will use the profile to perform 
path-directed optimizations and to place code so as to 
reduce cache misses. Those changes will improve the 
performance of translated code. 



Rctranslation of an image is triggered by growth in 
the size of the profile. Because profile data is generated 
only when the emulator executes previously untrans- 
lated parts of the source image, an increase i n the size 
of the profile indicates that new parts of the program 
have been executed. Retranslating with the new pro- 
file will cause these additional parts of the image to be 
translated. 

Alignment Issues 

On an Alpha system, references to memory locations 
that are not naturally aligned result in exceptions that 
are handled by the Windows NT kernel. Alignment 
exceptions can be avoided by using unaligned code 
sequences that use the LDQ_U and STQ_U instruc- 
tions. Unaligned code sequences are slower than 
aligned sequences for accessing locations that are nat- 
urally aligned but much faster for accessing locations 
that are not naturally aligned. Native Alpha compilers 
always try to generate unaligned code sequences when 
referencing unaligned data to avoid the expense of 
dealing with alignment exceptions. 

When generating the code for an instruction that 
references memory, the code selector must determine 
whether to use an aligned sequence or an unaligned 
sequence. To make the determination, the code selec- 
tor needs to know the alignment of the address being 
referenced. In general, this cannot be determined by 
static analysis of the x86 code. To solve the problem, 
the code selector uses information in the profile about 
the alignment of memory addresses. The profile con- 
tains the address of every instruction that made an 
unaligned reference to memory. The code selector 
generates unaligned sequences for those instructions 
and aligned sequences f or all other memory references. 
Although this code generation process is effective most 
of the time, some programs exhibit different memory 
reference behavior on successive runs. For those pro- 
grams, alignment exceptions can still occur. 

Shadow Stack 

Translating return instructions presented particular 
problems for the translator. The translation of a call 
instruction saves the x86 return address on the x86 
stack and then calls the translated code for the routine. 
After the translated call, the x86 return address is on 
the x86 stack and the corresponding native return 
address is i n an Alpha register. This maintains the x86 
stack i n the expected x86 state. One way to translate a 
return instruction would be to use the x86 return 
address to look up a corresponding Alpha address; 
however, it is desirable to avoid the expense of a hash 
table lookup on every return. In the usual case, the 
return address is not changed by the routine and the 
translated code can pop the x86 stack and perform a 
native return by using the native return address. Two 
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problems must be solved, though. First, some mecha- 
nism is needed to determine if the ,\86 return address 
has been modified. Second, a location is needed to 
save the native return address. Both problems are 
solved by using the shadow stack. 

The shadow stack resides at the top of the native 
Alpha stack and is maintained by the translated code 
(with help from the emulator). A shadow stack frame is 
created for each call of a translated routine. When one 
D'anslated routine calls another, the calling routine saves 
the x86 return address and the current ,\86 stack pointer 
in its shadow stack frame. The called routine then saves 
the native return address in the calling routine's shadow 
stack frame. On return, the called routine expects to 
find the x86 return address and the current x86 stack 
pointer in the calling routine's shadow stack frame. In 
this case, the called routine is returning to the environ- 
ment that the calling routine expected and performs a 
native return. If the value of either the return address 
or the stack pointer has changed from the value 
expected by the calling routine, the called routine 
returns to die emulator. 

In a similar manner, the emulator uses the informa- 
tion in the shadow stack to determine when it can 
return to translated code. A number of conditions 
can cause translated code to reenter the emulator. For 
example, the emulator is entered if the target of a 
translated indirect jump instruction is not known at 
translation time. Having the emulator return to trans- 
lated code on a return instruction minimizes the 
amount of time that is spent in the emulator; however, 
the emulator can only return to the translated code i Fir 
knows that it has a valid return address. The shadow 
stack provides a mechanism to perform that validation. 

The Database 

The database consists of two parts. As described for 
the runtime, the first parr of the database is a directory 
tree that contains profile files, translator log files, and 
translated images. The second part of the database is 
kept in the registry and consists of information about 
x86 applications and images that the DIGITAL FXI32 
software has run on the system, together with config- 
uration information. The configuration information 
includes the maximum amount of disk space that can 
be used by FX!32, the maximum number of images 
that can be stored in the database, the default transla- 
tion options, the work list that the server uses to 
schedule translations, and the DatabaseDirectoryList. 
The DatabaseDirectoryList is a list of paths to addi- 
tional databases that are to be searched for image pro- 
files and translation results when the image is first 
executed. Directories on this list can be used to access 
information about the image from other machines on 
a network, making available to a user translations per- 
formed on another, perhaps more powerful, machine. 



The Server 

The server is a Windows NT service that normally 
starts whenever the system is rebooted. The server 
automatically runs the translator when appropriate, 
thus making the translation process completely trans- 
parent to the user. The server also maintains the data- 
base to control DIGITAL FX!32 resource usage. 

The Manager 

Usually the operation of DIGITAL FX!32 software is 
completely transparent to the user. Like any other pro- 
gram, though, FX!32 consumes system resources and a 
user must be able to control that resource usage. One 
of the roles of the manager is to provide a user interlace 
to the configuration information kept in the database. 

Figure 2 shows the manager window. The upper 
pane contains information about the various applica- 
tions that have been run on the system: the total 
amount of disk space being used for profiles and trans- 
lations of images loaded by the application, the num- 
ber of times the application has been run, the date 
when it was last run, and the optimizer (translator) 
status. The lower pane contains information about 
the images that have been loaded by the highlighted 
application in the upper pane: the total amount of disk 
space used to store the profile and translation of the 
image, the number of times the image has been 
loaded, the date on which it was last loaded, and the 
status of the last translation of the image. 

By interacting with the manager, the user can con- 
trol various aspects of FX!32 operation, such as the 
maximum amount ofdisk space to use, which informa- 
tion to retain in the database, and when the translator 
should run. 

Results 

The DIGITAL FXI32 development team had two pri- 
mary goals for the software: ( 1 ) to achieve transparent 
execution of 32-bit \'86 applications and (2) to yield 
approximately the same performance as a high-end 
x86 platform when running applications on a high- 
performance Alpha system. The DIGITAL FXI32 
product meets both goals. 

Transparency is provided by the transparency agent 
and a run-time environment that can load and execute 
an x86 application without a translation step. Appli- 
cations can be launched and executed on an Alpha 
system that is running FX! 32 just as they can on an 
x86 system. We have performed extensive testing 
of more than 75 applications that run using FX!32, 
including major commercial applications such as 
Microsoft Office 95, Visual Basic 4.0, Photoshop 4.0, 
and CorelDRAW 6.0. 
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Figure 2 

The DIGITAL FX!32 Manager 



DIGITAL FX!32 software also met its performance 
goal. Figure 3 shows the relative performance on 
D^TE Magazine's BYTKmark benchmark of a 200- 
megahertz (MHz) Pentium Pro system and a 500- 
MHz Alpha system running FX!32. For this 
benchmark, the Alpha system provides about the 
same performance as the 200-MHz Pentium Pro 
system. Figure 3 also shows that the Alpha native 



version of the benchmark runs twice as fast as the 
Pentium Pro version. 

Of course, no single benchmark characterizes the 
performance of a system. F.vcn so, when running 
translated ,\86 applications, we have consistently mea- 
sured performance on a 500-MHz Alpha system to be 
in the range between that of a 200-MHz Pentium sys- 
tem and that of a 200-MHz Pentium Pro svstem. For 
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Figure 3 

DIGITAL FX! 32 Performance on the BYTK Benchmark) 
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some applications, performance can exceed that of' a 
Pentium Pro svstem. 

The initial version of the DIGITAL FX!32 software 
has some limitations. FX! 32 executes only application 
code; it does not execute drivers. Consequently, native 
drivers arc required for any peripheral that is installed 
on an AJpha system. Also, as described in the 
Transparency Agent section, FX! 32 does not provide 
complete support for x86 services. Further, FX!32 
does not support the Windows NT Debug AIM. 
Supporting that interface would require the capability 
to rematerialize the x86 state after every x86 instruc- 
tion, thus severely limiting optimizations that the 
translator could perform. Optimizing compilers make 
a similar trade-off" by restricting optimization when 
debugging information is required. Since FX! 32 does 
not support the Debug interface, applications that 
require it do not run under FX!32. Those applications 
are mostly x86 development environments, and it 
probably makes more sense to run them on an ,\86 
system. The limitations described are not serious, and 
most x86 applications that execute on an ,\86 proces- 
sor that is running the Windows NT operating system 
also execute on an Alpha system running Windows NT 
and DIGITAL FX!32 software. 

Summary 

DIGITAL FX!32 software provides fast, transparent 
execution of 32-bit x86 applications on AJpha systems 
running the Windows NT operating system. This is 
accomplished using a unique combination of emula- 
tion and binary translation. The emulator runs an 
application, interprets the code, and generates profile 
information. For subsequent executions, the translator 
uses the profile data to produce translated images that 
contain optimized native Alpha code. An application 
translated by means of DIGITAL FX!32 software runs 
up to 10 times faster than the same application run- 
ning undcrthc emulator alone. Moreover, the transla- 
tion takes place in the background and is therefore 
transparent to the user. 
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Leo P. Treggiari 



Development of the 
Fortran Module Wizard 
within DIGITAL Visual 
Fortran 



The Fortran Module Wizard is one of the tools 
in DIGITAL Visual Fortran, a DIGITAL product for 
the Fortran development environment. Visual 
Fortran consists of the DIGITAL Fortran 90 compiler 
and run-time libraries and the Microsoft Developer 
Studio. Together, these technologies provide a 
rich set of tools for the Fortran developer who 
is using the Windows NT and Windows 95 sys- 
tems. The Fortran Module Wizard generates 
complete Fortran source code, allowing Fortran 
applications to invoke routines in a dynamic link 
library, methods of an Automation object, and 
member functions of a Component Object 
Model (COM) object. 



DIGITAL Visual Fortran is an integrated development 
environment for Fortran applications. 1 It is supported on 
the Windows NT version 4.0 operating system on both 
Alpha and Intel hardware and on the Windows 95 sys- 
tem. DIGITAL Visual Fortran is a combination of tech- 
nologies from DIGITAL and Microsoft Coiporation. 
The DIG ITAL-supplicd compiler and run-time libraries 
support the DIGITAL Fortran 90 language. 2 DIGITAL 
Fortran 90 conforms to American National Standard 
Fortran 90 (ANSI X3. 198- 1992) and provides many 
extensions to the Fortran 90 standard. The Microsoft- 
supplied integrated development environment is the 
Microsoft Developer Studio, which is also used by 
Microsoft Visual C++, Microsoft Visual J++ (for Java), 
other Microsoft tools, and other companies' develop- 
ment tools. Developer Studio includes a text editor, 
resource editors, project build facilities, an incremental 
linker, a source code browser, an integrated debugger, 
and a profiler. The operation of all these tools is con- 
trolled from a single application. Figure 1 shows an 
example of Microsoft Developer Studio from which two 
Fortran source files are being edited. DIGITAL adds a 
number of Fortran-specific tools to the environment, 
one of which is the Fortran Module Wizard. 

Design of the Fortran Module Wizard 

DIGITAL designed the Fortran Module Wizard to 
help Fortran developers working in the application-rich 
Windows environment. The Fortran Module Wizard 
supports access to dynamic link libraries (DLLs) and 
servers based upon Microsoft's Component Object 
Model (COM). This support allows Fortran developers 
to use the popular mechanisms that make functionality 
(services) available to other software (clients). 

Traditionally, Microsoft and others have provided 
system interfaces and reusable libraries of code as 
DLLs. A DLL is a file containing functions that can be 
called by programs and other DLLs. The role of DLLs 
on a Windows system is very similar to that of share- 
able images on the Open VMS operating system and 
shared libraries on the UNIX system. Today, DLLs are 
still the primary mechanism for accessing system inter- 
faces on Windows. 
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Figure 1 

Microsoft Developer Studio, Two Fortran Source Files Being F.dited 



When Microsoft introduced OLE version 1, the 
name OLE was an acronym for object linking and 
embedding. OLE version 1 enabled compound docu- 
ments by allowing a document to link to, or embed 
data from, another document. In 1993, Microsoft 
introduced COM as the base architecture of OLE 
version 2. 3 COM is an extensible architecture that pro- 
vides mechanisms for creating and using software com- 
ponents. A software component consists of reusable 
pieces of code and data in binary form that can be 
plugged into other software components from other 
vendors with relatively little effort." Like DLL.S, COM 
allows a software developer to provide a set of services 
to multiple clients. In addition, COM has the advan- 
tage of allowing the services to reside in another 
process and on another machine. (Distributed COM 
[DCOM] allows objects to be created and used on 
remote machines.) COM also contains features that aid 
in the deployment and evolution of the services." 
Microsoft has extended its languages and tools to aid 
software developers in the creation of clients and 
servers based upon COM (hereafter referred to as 
clients and servers in this paper). 



Why does a Fortran developer need help accessing 
services in DLI^s and servers? Calling code that is writ- 
ten in another programming language is, in general, 
difficult. There are complex issues around calling stan- 
dards and data type representations. If a mistake is 
made in manually translating a function signature 
from one language into another, today's program- 
ming environments are of little help. The application 
can fail at a point in the code, for example in the rou- 
tine prolog, which does little to suggest the cause of 
the problem. Often, solving these problems requires 
understanding the intricacies of calling standards and 
single stepping through assembly code. Calling the 
components in a server also requires understanding 
and properly using a number of COM programming 
interfaces. 

The Fortran Module Wizard deals with the difficul- 
ties. It reads a description of a service, which the ser 
vice provider created, and generates Fortran source 
code. This automatically generated code makes calling 
these services as easy as calling another Fortran func- 
tion or subroutine. 
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Enabling Technologies 

Components of COM, Fortran 90, and the Microsoft 
Developer Studio enable die functionality of the Fortran 
Module Wizard. This section gives an overview of these 
technologies. 

COM Technologies 

As mentioned earlier, COM provides mechanisms for 
creating reusable software components. This paper 
attempts to explain only those parts of COM, and some 
technologies based on COM, necessary for the reader 
to understand the use of server functionality from 
code generated by the Fortran Module Wizard. COM, 
OLK, and ActiveX, of course, contain many more 
mechanisms.'' A number of the references listed at the 
end of this paper are good sources of further read- 
ing. 4 7 Much of the description of COM in the follow- 
ing section is taken from the Component Object 
Model Specification. 8 

COM Objects COM is an object-based programming 
model designed to promote software interoperability. 
In other words, COM allows two or more applications 
or components to easily cooperate with one another, 
even if they were written by different vendors at diff er- 
ent times, in different programming languages, or if 
thev are running on different machines running differ- 
ent operating systems. COM defines a completely stan- 
dardized mechanism for creating objects and for clients 
and objects to communicate. Unlike traditional object 
oriented programming environments, these mecha- 
nisms are independent of the applications that use object 
services and of the programming languages used to 
create the objects. COM therefore defines a binary 
interoperability standard rather than a language based 
interoperability standard on any given operating sys- 
tem and hardware platform. 

To support its interoperability features, COM defines 
and implements mechanisms that allow components to 
connect to each other as objects. The definition of an 
object is a piece of software that contains the functions 
that represent what the object can do (its intelligence) 
and associated state information for those functions 
(data). In other words, an object is some data structure 
and some functions to manipulate that data. In this 
paper, we use the term object to mean an object 
instance, as opposed to an object class. An object class is 
similar to a derived-tvpe in Fortran 90 or a structure in 
C. It specifies a blueprint for object instances that a 
server will create upon a client's request. An important 
principle of object-oriented programming is encapsula- 
tion, in which the exact implementation of those func- 
tions and the exact format and lavout of the data is only 
of concern to the object itself. This information is hid- 
den from the clients of an object and can therefore be 
changed without affecting the client. 



With COM, components interact with each other 
and with the system through collections of function 
calls, also known as methods or member functions or 
requests, called interfaces. An interface is a scmanti- 
cally related set of member functions. The interface as 
a whole represents a feature of an object. The member 
functions of an interface represent the operations that 
make up the feature. 

For a quick look at a simple example of a COM 
object, imagine a Calculator object that is willing to 
provide arithmetic services to any client. It could sup- 
port an interface named ICalculate. By convention, 
the letter I always prefixes the name of an interface. 
The ICalculate interface could contain member func- 
tions named Add, Subtract, Multiply, Divide, etc. If a 
client wanted to use the services of the Calculator 
object, it would request COM to create an object of 
class Calculator and request the ICalculate interface. It 
could then call the member functions of the ICalculate 
interfaces (Add, Subtract, etc.). 

With COM, a pointer to an object is actually a 
pointer to a particular interface that the object sup- 
ports. All COM objects support the interface named 
Iunknown, which contains the member functions 
named AddRef, Release, and Querylnterf'ace. All COM 
objects must implement these member functions. 
AddRef and Release implement object reference 
counting. Clients use them to tell an object when they 
are using it and when they are done. Objects delete 
themselves when they are no longer being used by any 
client. Quervlnterface is the basis for a process called 
interface negotiation, whereby a client asks an object 
what services it is capable of providing. For example, 
if a client had a pointer to the Calculator object's 
IUnknown interface, it could get a pointer to its 
ICalculate interface by calling the IUnknown Query- 
Interface member function. In general, an object can 
support multiple interfaces and a client can use Query- 
Interface to get a pointer to any of them. Examples in 
which Fortran code calls member functions in inter- 
faces are given in the section Fortran Module Wizard 
Functionality. Microsoft defines a number of useful 
interfaces. Object class creators are free to use existing 
interfaces and define their own. 

Automation Objects One Microsoft-defined interface, 
IDispatch, is the basis for Automation. 7 Any object 
that supports this interface, also known as a dispinter- 
face, is an Automation object, and can be accessed by 
any Automation client. An Automation object exposes 
methods and properties. Methods are functions that 
perform an action on an object and are similar to the 
member functions of COM objects. Properties hold 
information about the state of an object. A property 
can be represented bv a pair of methods; one for get- 
ting the property's current value, and one for setting 
the property's value. 
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The capabilities ofan Automation object are similar 
to those of a COM object. An Automation object is, in 
fact, a COM object; that is, it supports the lUnknovvn 
interface as well as the IDispatch interface. However, 
the mechanisms for using the services of the two are 
very different. Microsoft designed Automation based 
on the needs of scripting or macro languages (i.e., 
Visual Basic). It does not require understanding the 
intricacies of calling conventions as does COM. It sup- 
ports mechanisms more suitable to the dynamic query- 
ing ofan object's capabilities. This makes Automation 
more suited to late binding of objects, that is, invoking 
methods of a previously unknow n object at run time. 

An Automation client accesses all the methods and 
properties of an Automation object through a single 
member function of the IDispatch interface named 
Invoke. The client passes Invoke a number of argu- 
ments that identify 

■ The method, its arguments, and a place to receive 
the return value, or 

■ The property and its new value, or 

■ The property and a place to receive its current value 

In fact, Invoke could be described as the Swiss army 
knife of Automation programming. 

Most of the differences between Automation objects 
and COM objects are hidden bv the Fortran interfaces 
that the Wizard generates. 

Object Identification To enable the use of COM objects 
created by disparate groups of developers, there must 
be a method of uniquely identifying an object class 
regardless of its origin. COM uses globally unique 
identifiers (GUIDs) to do this. A GUID is a 16-byte 
integer value that is guaranteed (fi»r all practical pur- 
poses) to be unique across space and rime. COM uses 
GUIDs to identity object classes, interfaces, and other 
things that require unique identification. COM pro- 
vides a routine named CoCrearcGUID, and Microsoft 
provides a utility named GUIDGEN, that a developer 
uses to generate a GUID. Assigning a GUID to an 
object class or interface is the job of the creator of the 
class or interface. To create an instance ofan object, 
the developer needs to tell COM the GUID of the 
object. Using 16-byte integers for identification is fine 
for computers, but it poses a challenge for the typical 
developer. COM supports the use of a less precise, tex- 
tual name called a programmatic identifier (ProglD). 
A ProglD takes the form: 

app L i ca t i on_name . ob jec t_name . ob j ect_ver si on 

For example, the name of the Basic object of the 
Microsoft Word application is Word. Basic. I . Similarly, 
interfaces are usually discussed using their Ixxx name 
(for example, [Unknown), but their GUID uniquely 
identities them. ProglDs are not supplied for all objects. 



They are normally supplied only for Application 
objects. An Application object is a top-level object that 
becomes active when the application starts. It provides 
a starting point for clients to access all of an applica- 
tion's subordinate objects. 

Type Information Type information contains descrip- 
tions of object classes, interfaces, DLLs, data structures, 
and so forth that are independent of any program- 
ming language. A developer accesses type information 
through an interface named ITvpelnfo." A client can 
get a pointer to type information from 

■ A running Automation object 

■ A running COM object that supports the 
I ProvideClassInto interface 

■ A type library 

A type library is a collection oftype information for 
any number of object classes, interfaces, etc. A devel- 
oper can store a type library in a separate file ( using a 
.TLB extension by convention), or as part of another 
file. For example, the type library that describes the 
type information for a DLL can be stored in the .DLL 
file itself. Since the rvpe information is stored in a file, it 
is available regardless of whether or not the client has a 
pointer to the object(s) that the information describes. 

The easiest way to create a type library is to write a 
script in the Microsoft Interface Definition Language 
(1DL). The Microsoft IDL compiler (MIDL) reads an 
I DL script and creates a .TLB tile. 10 An IDL script is similar 
to a C++ header file with additional syntax for informa- 
tion required bv COM. An example of such information 
is whether an argument to a member function is an input, 
an output, or an input/output argument. 

To use the Fortran Module Wizard, the developer 
must know where to find rvpe information for the func- 
tionality to be used. Some examples of this are given in 
the section Fortran Module Wizard Functionality. 

Fortran 90 

This section describes features of the DIGITAL Fortran 
90 language that the Fortran Module Wizard uses in 
the code that it generates. 

Modules Fortran 90 does nor support objects, but it 
does provide a new form of program unit called a 
module. A Fortran module is a set of declarations that 
are grouped together under a global name and are 
made available to other program units by means of the 
Fortran USK, statement. These modules fuive similari- 
ties to C include files but are more powerful. 

The Fortran Module Wizard generates a source tile 
containing one or more Fortran modules and places 
the following types of information in the modules: 

■ Derivcd-rypc definitions — Fortran equivalents of 
data structures that are found in the type information. 
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■ Procedure interface definitions — Fortran interface 
blocks that describe the procedures found in the 
type information. 

■ Procedure definitions — Fortran functions and sub- 
routines that are wrappers for the procedures found 
in the type information. The wrappers make the 
external procedures easier to call from Fortran by 
handling data conversion and low-level invocation 
details. 

The use of modules allows the Fortran Module Wizard 
to encapsulate the data structures and procedures 
exposed by an object or DLL in a single place. These 
definitions can be shared in multiple Fortran programs. 

Attributes The DIGITAL, Fortran 90 language sup- 
ports a number of calling convention attributes that 
allow Fortran programs to call programs written in 
other programming languages. Some attributes select 
the calling convention (STDCALL, C, VARYING). 
Others determine whether an argument is passed by 
value or by reference (VALUE, REFERENCE). Another 
attribute defines the external name of the procedure 
(ALIAS). 

Pointer To Procedure The address of a COM member 
function is never known at program link time. The 
developer must get a pointer to an object's interface at 
run time, and the address of a particular member func- 
tion is computed from that. We have extended the 
DIGITAL Fortran 90 language to support a Pointer 
To procedure. 

Microsoft Developer Studio 

Microsoft Developer Studio provides a number of 
methods that allow software developers to extend its 
environment." This section describes these methods. 

Tools Menu Developer Studio contains a Customize 
dialog box through which the developer can add utili- 
ties to the Tools menu and then run those utilities 
from within Developer Studio. 

Gallery The Developer Studio Gallery provides a 
central repository for all reusable parts of projects. The 
reusable parts can range from something as simple as a 
bitmap to something as complex as a DLL. 

Developer Studio Object Model Developer Studio 
provides a set of COM objects that give developers 
programmatic control of its functionality. Users can 
create commands that perform specific tasks and add 
them to a toolbar. The Developer Studio Object 
Model is programmed in three ways: (1) by creating 
macros in the Visual Basic Scripting Edition Language 



(VBScript); (2) by creating a Developer Studio DLL 
Add-in, which is a server implemented as a DLL; and 
(3) by creating a separate Automation client that con- 
nects to the Developer Studio objects. 

Wizards A wizard is code that creates the starrer 
files for a new application or adds a feature to an 
existing application. Wizards that add features are 
stored in the Dev eloper Studio Gallery. Wizards that 
create starter files for a new application are called 
AppWizards. When the developer requests the cre- 
ation of a new project, Developer Studio presents a 
list of the types of project that can be created (for 
example, a console application or a DLL). In addi- 
tion, it lists the installed AppWizards that can gen- 
erate complete applications. Often they contain 
options that allow the developer to choose the fea- 
tures of a generated application. 

Microsoft Visual C++ provides a number of 
AppWizards; most of them can create typical C++ 
applications. In addition, to aid developers in extend- 
ing Developer Studio, one AppWizard creates the 
starter flies for a custom AppWizard, and another 
creates the starter files for a DLL Add-in. The Fortran 
Module Wizard is currently implemented as an appli- 
cation that runs from the Developer Studio Tools 
menu. In the future, it may be a Developer Studio 
AppWizard . 

Fortran Module Wizard Functionality 

This section describes the user interf ace of the Fortran 
Module Wizard and presents some samples of the code 
generated by the Wizard. It also shows examples of 
calling the generated code from Fortran. 

User Interface 

Upon opening the Fortran Module Wizard from the 
Tools menu, the user is presented with a series of 
dialog boxes. From these, the user selects the type 
information for the functionality needed. 

Figure 2 shows the first dialog box. It requests the 
user to choose the source of the rvpe information that 
describes the required functionality. The developer 
must consult the documentation to determine what 
type of object (or DLL) the functionality is imple- 
mented as, and where to find its associated type infor- 
mation. The choices are the following: 

■ Automation object 

■ Type library containing automation information 

■ Type library containing COM interface information 

■ Type library containing DLL information 

■ DLL containing type information 
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Fortran Module Wizard 



Select source of OLE type information 
C Automation Object 

C Type Library containing Automation information 
<~ Type Library containing COM interface information 
r* Type Library containing DLL information! 
DLL containing type information 

17 Generate procedures to convert between Fortran and C strings 
Module Name: 



Next 



Figure 2 

Fortran Module Wizard Dialog Box 

Automation Object Microsoft recommends that servers 
provide a type library. Some applications, for example 
Microsoft Word version 7.0, do not, but they do 
provide type information dynamically when running. 
When this option is selected, Developer Studio dis- 
plays the dialog box shown in Figure 3. The user then 
enters the name of the application, the name of the 
object, and optionally the version number. Note that 
this method works only for objects that provide a 
ProglD. ProglDs are entered into the system registry 
and identify', among other things, the executable pro- 
gram that is the object's server. 

Alter the user enters the information and presses the 
"Generate button," the Fortran Module Wizard asks 
COM to create an instance of the object identified by 
the ProglD that the Wizard constructs from the user- 
supplied information. C9M starts the object's server if 
it needs to do so. The Wizard then asks the object for 
its type information and generates a file containing 
Fortran modules. 

Other Options If the user chooses one of the remain- 
ing options, that is, any of the type libraries or the DLL 
(see Figure 2), •eveloper Studio displays the dialog 
box shown in Figure 4. From this dialog box, the user 
chooses the tvpe library (or file containing the type 
library) and, optionally, the specific components of the 
type library. 



Exit 



At the top of the dialog box, a "combo box" lists all 
the type libraries that have been registered with the 
system. Their file names have a number of different file 
extensions, for example, .OLB (object libraries) and 
.OCX (ActiveX controls). The user either selects a type 
library from the list or presses the "Browse button" to 
find the file using the standard "Open dialog box." 
After selecting a type library, the user presses the 
"Show button" to list the interfaces described in the 
type library. By default, the Fortran Module Wizard 
uses all the interfaces; however, the developer can select 
the ones desired from the list. 

After the user enters the information and presses the 
"Generate button," the Fortran Module Wizard asks 
COM to open the type library and generates a rile con- 
taining Fortran modules. 

Generated Code 

The Fortran Module Wizard generates different code, 
depending upon the type of object or D LL described by 
the type information. Note that the generated code is a 
static representation of an object's type information. If 
the type information should change in a future release 
of the object, the Wizard would need to be run again. 

Fortran Run-time Support DIGITAL Visual Fortran 
provides a set of run-time routines that present to the 
Fortran programmer a higher-level abstraction of the 
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Application Name: 
Object Name: 
Object Version: 




Generate 



Cancel 



Figure 3 

Microsoft Developer Studio Dialog Box for Application Object Selection 
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Type Information File Narne- 



Interface(s) 
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Show 
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Figure 4 

Microsoft Developer Studio Dialog Box for Type Library Selection 
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IDispatch member functions and other COM functions. 
The routines arc used in the code that the Wizard gen- 
erates. They allow the programmer to perform the fol- 
lowing tasks: 

■ Initialize the COM library. 

- COM Initialize initializes the COM library. 

- COMUninitialize uninitializes the COM library. 

■ Get an interface pointer of an object. 

- COMCreatcObjecr passes a programmatic identi- 
fier or class identifier, and it creates an instance of 
an object and returns a pointer to one of the object's 
interfaces. 

- COMGetActiveObject passes a programmatic 
identifier or class identifier, and it returns a 
pointer to an interface of a currently active object. 

- COMGetFileObject passes a file name, and it 
returns a pointer to the IDispatch interface of an 
Automation object that can manipulate the file. 

-COMCLSIDFromPROGID passes a program- 
matic identifier, and it returns the corresponding 
class identifier. 

- COMCLSIDHromString passes a class identifier 
string, and it returns the corresponding class 
identifier. 

■ Get or set the value of a property of an Automation 
object. 

- AUTOSetProperty passes the name or identifier 
of the property and a value, and it sets the value of 
the Automation object's property. 

- AUTOGetPropertv passes the name or identifier 
of the property, and it gets the value of the 
Automation object's property. 

■ Invoke a method of an Automation object. 

- AUTOAllocatelnvokeArgs allocates an argument 
list data structure that holds the arguments that 
the user will pass to AUTOInvoke. 

-AUTOAddArg passes an argument name and 
value, and it adds the argument to the argument 
list data structure. 

-AUTOInvoke passes the name or identifier o fan 
object's method and an argument list data struc- 
ture, and it invokes the method with the passed 
arguments. 

- AUTODeallocatelnvokeArgs deallocates an argu- 
ment list data structure. 

- AUTOGetExceptionlnfo retrieves the exception 
information when a method has returned an 
exception status. 

■ Perform I Unknown interface member functions. 

- COMAddObjectReference adds a reference to an 
object's interface. 

- COMRclcaseObjcct indicates that the program is 
done with a reference to an object's interface. 

- COM Query Interface passes an interface identifier, 
and it returns a pointer to an object's interface. 



DIGITAL Visual Fortran provides three Fortran 
modules that define basic COM information: 

■ DFCOMTY defines basic COM types. 

■ Dl'COM defines the interfaces to the DIGITAL 
Visual Fortran COM routines and to some COM 
svstem routines. 

■ DFAUTO defines the interfaces to the DIGITAL 
Visual Fortran Automation routines. 

Automation Objects Figure 5 contains code gener- 
ated bv the Fortran Module Wizard for the Word. Basic 
object of Microsoft Word version 7.0. Word. Basic is an 
Automation object with almost 1,00# methods. These 
methods represent the functionality of the Word Basic 
language, which is the programming interface to 
Microsoft Word. The Microsoft Word, Word Basic 
documentation contains i nformation on the methods 
and their arguments. 12 We discuss some of the meth- 
ods here in a simple example of Fortran code automat- 
ing Word Basic to perform the task of replacing all the 
occurrences of a word in a document with another 
word. The Word. Basic methods of interest for this 
example arc the following: 

■ AppShow makes the Microsoft Word application 
visible. 

■ FileOpen opens a document. 

■ EditReplace replaces a string with another string. 

■ FileSavcAs saves a document. 

Figure 5 contains code from the Fortran subroutine 
generated for the Word Basic FileOpen method. It 
is representative of the code generated for all 
Automation methods. The lines are annotated on the 
lef t side with numbers that are not part of the source 
code but correspond to the list below. Note that the 
naming conv ention used for the generated wrappers is 
objectiiame_methodriame. Any periods in the name 
are replaced by underscores. 

1. If the type information provides a comment that 
describes the method, the comment is placed 
before the beginning of the procedure. 

2. The first argument to the procedure is always 
$OBJF,CT. It is a pointer to an Automation object's 
IDispatch interface. The last argument to the proce- 
dure is always SSTATUS. This optional argument can 
be specified if the Fortran programmer wishes to 
examine the return status of the method. The 
IDispatch Invoke member function returns a status of 
type HRESULT, which is a 32-bit value. HRESULT 
has the same structure as a Win32 error code. In 
between the SOBJECT and SSTATUS arguments 
are the method arguments' names determined from 
the tvpe information. When the type information 
docs not provide a name for an argument, the 
Fortran Module Wizard creates a SARGn name. 
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1- 


lOpens an existing document or template 


2- 


SUBROUTINE W o r d_B a s i c_ F i I e 0 p e n ( $ 0 B J E C T , Name, Con f i rinConversi ons, 




Readonly, LinkToSource, AddToMru, PasswordDoc, PasswordDot, 




Revert, W r i t e P a s s w o r d D o c , Uri tePasswordDot, Connection, 




SQLStatement, S Q L S t a t e m e n t 1 , SSTATUS) 




!DEC$ ATTRIBUTES DLLEXPORT :: Word_Basic_Fi leOpen 




IMPLICIT NONE 




INTEGER*^, INTENT(IN) :: SOBJECT ! Object Pointer 


3- 


!DEC$ ATTRIBUTES VALUE :: SOBJECT 


4- 


CHARACTER^*), INTENT(IN), OPTIONAL :: Name ! BSTR 




!DEC$ ATTRIBUTES REFERENCE :: Name 




INTEGER*^, INTENT(OUT), OPTIONAL :: SSTATUS ! Method status 




IDECS ATTRIBUTES REFERENCE :: SSTATUS 




INTEGER*4 SSSTATUS 




INTEGER*4 invokeargs 


5- 


invokeargs = A U T 0 A L L 0 C A T E I N V 0 KE A R G S ( ) 


6- 


IF (PRESENT(Name)) CALL A U T 0 A D D A R G ( i n v o k e a r g s , 'Name', Name, 




.FALSE., V T_B S T R ) 


7- 


SSSTATUS = AUTO I N VOKE ( SOB J E C T , 'FileOpen', invokeargs) 


8- 


IF ( PRESENT( SSTATUS ) ) SSTATUS = SSSTATUS 


9- 


CALL AUTODEALLOCATEINVOKEARGS (invokeargs) 




END SUBROUTINE Word_Basic_Fi leOpen 



Figure 5 

Representative Code Generated for Automation Methods 



3. This is an example of an attribute statement used to 
specify the calling convention of an argument. 

4. Methods can take optional arguments that must fol- 
low all the required arguments. In this method, 
there are no required arguments. The Fortran 
Module Wizard generates source lines for each 
argument using the data tvpc and calling conven- 
tions found in the type information. 

5. AUTOAIIocatcInvokeArgs allocates a data structure 
that is used to collect the arguments that the pro- 
grammer passes to the method. AUTOAddArg adds 
an argument to this data structure. 

6. For each optional argument, the Fortran PRESENT 
function is used to determine if the caller supplied 
the argument. If so, the argument is added to the 
argument list. 

7. Al'TOInvokc invokes the named method passing 
the argument list. This returns a status result. 

(S. If the caller supplied a status argument, the code 
copies the status result to it. 

9. AUTODcallocatcInvokeArgs deallocates the mem- 
ory used by the argument list data structure. 

Figure 6 shows code from a user-written Fortran 
program that invokes Microsoft Word to replace all 
the occurrences of a word in a document with another 
w»rd. The example code is annotated with numbers 
that correspond to the following list. 



1. COMCreatcObject requests COM to create an 
object with the ProgID Word. Basic. A pointer 
to the Word. Basic object's IDispatch interlace is 
returned in "wordapp." The IDispatch interface 
is returned with a reference count of 1 . 

2. The code checks to ensure that an IDispatch pointer 
was returned. If not, it displays an error message and 
exits. The programmer can examine the status vari- 
able for the specific status return code. 

3. The code calls Word. Basic methods to show the 
Microsoft Word window, open the document, 
replace the string, and save the modified document. 

4. COM Release-Object releases the single ref erence to 
the object's IDispatch interface so that Microsoft 
Word can terminate. 

COM Objects The Microsoft PowerPoint version 7.0 
type librarv contains a description of a number of COM 
objects and interfaces that make up the programmable 
interface to the Microsoft PowerPoint application. 
Figures 7 and 8 contain code generated bv the Fortran 
Module Wizard from the Microsoft PowerPoint version 
7.0 tvpe library. Unlike Microsoft Word, which prov ides 
a single object that presents all of Word's programmable 
functionality, PowerPoint provides a hierarchy of 
objects. The top-level object, Application, is identified bv 
the ProgID PowerPoint. Application. 7. The Application 
object contains member functions that return a pointer 
to subordinate objects, including the Presentations 
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1 


Create a Word object and make it visible 


1 - 


CALL COMCREATEOBJ ECT ("Word. Basic," uordapp, status) 


2- 


IF (wordapp == 0) THEN 








' C " Unable to create Microsoft Word object; Aborting")') 




CALL EXIT(-I) 




F N h T F 
C IN V 1 I 


3- 


CALL Word_Ba s i c_AppShow ( wordapp, "," $ S T A T U S = s t a t u s ) 




! Open the document 




lmll wo ro Das i c r i L eupen i woraapp, T i Lenatne, -to imiuo — siatusy 




! Replace all occurrences of the string 




CALL Word Basic EditRepLace(wordapp, findstring, replacestring. 




RepLaceALL=.TRUE., $STATUS=status) 




! Save the file 




CALL W o r d_B a s i c_F i I e S a v e A s ( wo r d a p p , filename, $ S T A T U S = s t a t u s ) 




! Release the Word. Basic object since we are done 


4- 


status = COMRELEASEOBJECT(wordapp) 



Figure 6 

Code from a User-written Fortran Program That Invokes Microsoft Word 



object. The Presentations object consists of a collection 
of Presentation objects. A Presentation contains a mem- 
ber function that returns a pointer to its SlidcShow 
object, and so on. By navigating this hierarchy, die devel- 
oper can select a pointer to a particular object's interface. 
A code example in which we use some of the PowerPoint 
objects and interfaces to run a slide presentation from 
Pow erPoint is given later in this section. 

Figure 7 contains the interface description of die 
Presentations object's member function named Open. It 
is representative of the interfaces generated for all COM 
member functions. The procedure naming convention 
is olyectncime_menibetfunctio)iriamc. The Open func- 
tion opens an existing PowerPoint presentation. 



The first argument to the procedure is always 
SOBJF.CT. It is a pointer to the object's interface. 
The remaining argument names are determined 
from the type information. 

A BSTR is a length-prefixed string data tvpe primar- 
ilv for use by Automation objects. The wrappers 
generated for COM member functions convert 
from Fortran strings to BSTRs and vice versa. 

A VARIANT is a data structure that can contain any 
type of Automation data. It contains a field that 
identifies the type of data and a union that holds the 
data value. The use of a VARIANT argument allows 
the caller to use any data type that can be converted 
into the data type expected by the member function. 



INTERFACE 

INTEGER*-!. FUNCTION P r e s e n t a t i o n s _0 p e n ( $ OB J E C T , fileName, 
Readonly, Untitled, WithWindow, Open) 
USE DFCOMTY 



Re 
Un 



INTEGER*^, INTENT(IN) 
'.DECS ATTRIBUTES VALUE : 

2- INTEGER*^, INTENT(IN) 
!DEC$ ATTRIBUTES VALUE : 

3- TYPE (VARIANT), INTENT( IN 
!DEC$ ATTRIBUTES VALUE : 
TYPE (VARIANT), INTENT( IN 
! DECS ATTRIBUTES VALUE : 
TYPE (VARIANT), INTENT( IN 
!DEC$ ATTRIBUTES VALUE : 

4- INTEGER*4, INTENT(OUT) : 
!DEC$ ATTRIBUTES REFERENCE 

!DEC$ ATTRIBUTES STDCALL 
END FUNCTION Present at i ons_0pen 
END INTERFACE 

5- PO I NT E R ( Prese n t a t i ons_Open_PTR, Presen tat i ons_0pen ) 



Object Pointer 
BSTR 



: : SOBJECT 
SOBJECT 

: : fileName 
f i I e N ame 

Readonly ! (Optional Arg) 
adOn I y 

Untitled ! (Optional Arg) 
i t I ed 

WithWindow ! (Optional Arg) 
Wi thWindow 
Open 

: : Open 
Presentatio n s_0 pen 



Figure 7 

("ode Cienerated bv Fortran Module Wizard from Microsoft PowerPoint, Interface Description of Open Function 
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4. Nearly every COM member function returns a status of 
type H RESULT. Therefore if a COM member func- 
tion produces output, it uses output arguments to 
return the values. In this example, die Open argument 
returns a pointer to a PowerPoint Presentation object. 

5. The interface of a COM member function looks 
similar to the interface for a DLL function with one 
major exception. Unlike a DLL function, the address 
of a COM member function is never known at pro- 
gram link time. To compute the address of a particular 
member function, the developer must get a pointer to 
an object's interface at run time. We have extended the 
DIGITAL Fortran 90 language to support a Pointer 
To procedure. Figure 8 shows an example of its use. 

Figure 8 contains the wrapper generated by the 
Fortran Module Wizard for the Open function. The 
name of a wrapper is the same as the name of the cor- 
responding member function, prefixed with a $. The 
numbers inserted at the left margin of the code exam- 
ple correspond to the following list. 

1 . The wrapper takes the same argument names as the 
member function interface. 



2. Member function arguments of type BSTR are of 
type CHARACTER* ( * ) in the wrapper. 

3. The wrapper computes the address of the member 
function from the interface pointer and an offset 
found in the interface's type information. In imple- 
mentation terms, the sequence is the following: an 
interface pointer to a pointer to an array of function 
pointers called an Interface Function Table (see 
Figure 9). 

4. The wrapper declares a local variable to hold the 
BSTR to be passed to the member function. The next 
line docs the conversion. 

5. Optional VARIANT arguments of a COM member 
function are represented by a VAIUAjNT with distin- 
guished values. OPTION AL_VARIANT is defined 
in the DFCOMTY module with the distinguished 
values. 

6. The offset of the Open member function is 60. The 
code assigns the computed address to the function 
pointer Presentations_Open_PTR, which was 
declared in Figure 7, and then calls the function. 



1- INTEGER*^ FUNCTION $ P r e s e n t a t i o n s_0 p e n ( $ OB J E C T , fileName, 

Readonly, Untitled, WithWindow, Open) 
!DEC$ ATTRIBUTES DLLEXPORT :: $ P r e s e n t a t i o n s_0 p e n 
IMPLICIT NONE 

INTEGER*^, INTENTC IN) :: SOBJECT ! Object Pointer 

!DEC$ ATTRIBUTES VALUE :: SOBJECT 

2- CH AR ACTER* ( * ) , INTENT(IN) :: fileName ! BSTR 
(DECS ATTRIBUTES REFERENCE :: fileName 

TYPE (VARIANT), INTENT(IN), OPTIONAL :: Readonly 

!DEC$ ATTRIBUTES REFERENCE :: Readonly 

TYPE (VARIANT), INTENT(IN), OPTIONAL :: Untitled 

!DEC$ ATTRIBUTES REFERENCE :: Untitled 

TYPE (VARIANT), INTENT(IN), OPTIONAL :: WithWindow 

'.DECS ATTRIBUTES REFERENCE :: WithWindow 

INTEGER*A, INTENT(OUT) :: Open ! IDispatch 

!DEC$ ATTRIBUTES REFERENCE :: Open 

INTEGER*^ SRETURN 

3- INTEGER*^ SVTBL ! Interface Function Table 
P0INTER($VPTR, SVTBL) 

TYPE (VARIANT), :: $ VAR_ReadOnly 
TYPE (VARIANT), : : $ VAR_Untitled 
TYPE (VARIANT), :: $ V A R_W i t h W i n do w 

4- INTEGER*^ $ B S T R _f i I e N a m e ! BSTR 

$BSTR_f i leName = ConvertStringToBSTR(fi LeName) 

5- IF (PRESENT (Readonly)) THEN 

$VAR_Read0n ly = Readonly 
ELSE 

$VAR_ReadOn ly = 0 P T I 0 N A L_VA R I A N T 
P r e s e n t a t i o n s_0 pen_PTR = SVTBL 
END IF 

6- SVPTR = SOBJECT ! Interface Function Table 
SVPTR = SVTBL + 60 ! Add routine table offset 
Presen t a t i ons_Open_PTR = SVTBL 

SRETURN = Presentations_Open($OBJECT, $ B S T R_f i I e N a m e , 

Readonly, Untitled, WithWindow, Open) 
SPresen t a t i ons_0pen = SRETURN 
END FUNCTION SPresen t a t i ons_0pen 



Figure 8 

Code Generated by Fortran Module Wizard from Microsoft PowerPoint, Wrapper for Open Function 
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INTERFACE 
POINTER 



POINTER 



INTERFACE 

FUNCTION 

TABLE 



FUNCTION 1 



FUNCTION 2 



FUNCTION 3 



Figure 9 

Interface Pointer to an Array of Function Pointers 



In fact, PowerPoint provides dual interfaces. A dual 
interface is a combination of an IDispatch interface 
and COM member functions. The IDispatch inter- 
face of the dual interface can be used by Automation 
clients, and the COM member functions can be used 
bv COM clients. This means that for PowerPoint, and 
any server that provides dual interfaces, the Fortran 
developer can choose to generate a Fortran module 
for the Automation interfaces or the COM interfaces. 
The Fortran interfaces generated bv the Wizard likelv 
will not be much different. COM interfaces typically 
provide better performance since there is less over- 
head in invoking COM member functions than 
dispintcrface methods through the IDispatch Invoke 
member function. 

Figure 10 shows code from a user-written Fortran 
program that invokes PowerPoint to run a slide pre- 
sentation. The code example is annotated with num- 
bers that correspond to the following list. 



1. COMCI,SIDFromPROGID and COMCreateObject 
request COM to create an object with the ProgID 
PowerPoint. Application. 7, and to return a pointer 
to the object's IApplication interface. 

2. The code gets the AppWindow object from the 
Application object and calls its Visible member 
function to make PowerPoint visible. 

3. The code gets the Presentations object from the 
Application object and calls its Open member 
function to open a Presentation. Note that three 
of the arguments to Open are of the VARIANT 
data type. The code sets them to the values true 
and false. 

4. The code gets the SlideShow object from the 
Presentation object and calls its Run member func- 
tion to run the slide show. 

DLLs When the Fortran Module Wizard reads the 
type information describing a DLL, it generates an 
interface description for each function in the DLL. It 
also generates Fortran-derived types for data struc- 
tures defined in the DLL tvpe information. 1'his 
relieves the Fortran developer from manually translat- 
ing header file descriptions to Fortran descriptions. 
The Wizard also provides the option of generating 
wrappers that convert from the Fortran representation 
of strings to the C representation of strings and vice 
versa. This option can be selected from the Wizard's 
initial dialog box (see Figure 2). 



! Create a PowerPoint Application object 
! and make the AppWindow visible 

1- CALL C0MCLSIDFR0MPR0GID ( " P o w e r P o i n t . A p p I i c a t i o n . 7 , " 

clsid, status) 

CALL COMCREATEOBJECT (clsid, C L S C T X_S E R V E R , I I D_A pplication, 

ppApp I i c a t i on , status) 
IF ( ppApp I i ca t i on == 0) THEN 

WRITE (*, '(" Unable to create PowerPoint object; Aborting")') 

CALL E X I T ( - 1 ) 
END IF 

2- status = $ A p p I i c a t i on_G e t A p p W i n d o w ( p p A p p I i c a t i o n , ppAppWindow) 
status = $ A p p I i c a t i o n W i n d o w_S e t V i s i b I e ( p p Ap p W i nd o w , 1) 

! Open the specified presentation 

3- status = $ A pp I i c a t i on_G e t P r e s e n t a t i on s ( pp A pp I i c a t i on , 

ppPresentations) 

vTrue%VT = VT_B00L 

vTrue%VU%B0 0L_VAL = V A R I A N T_B 0 0 L_T R U E 
vFalse%VT = VT_B00L 

vFa I se%VU%B00 L_V A L = V A R I A N T_B 0 0 L_F A L S E 

status = $ P r e s e n t a t i o n s_0 p e n ( p p P r e s e n t a t i o n s , filename, 
(/True, vFalse, vTrue, ppPresen tat i on ) 

! Run the slide show 
k- status = $ P r e s e n t a t i on_G e t S I i d e S h o w ( p p P r e s e n t a t i o n , ppSlideShow) 
status = $ S I i d e S h ow_R u n ( pp S I i d e S h o w, 1, ppRun) 



Figure 10 

Fortran Program to Invoke PowerPoint to Run Slide Presentation 
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Comparison of the Wizard to the Capabilities of 
Other Languages 

Visual C++ version 5.0, Visual J++ version 1.1, and 
Visual Basic version 5.0 all have w izards that can read a 
type library and allow applications to use COM 
and/or Automation objects. 

The Visual C++ ClassWizard can read a rvpc library 
and create a class with all the functions of the 
IDispatch interface described in the library. Visual C++ 
version 5.0 also adds a preprocessor directive, 
#import. The #import directive reads a type library 
and generates two header files that contain the defini- 
tions ofthc COM objects defined in the type library." 

The Java Tvpe Librarv Wizard within Visual J+ + 
invokes the JavaTLB utility to convert the information 
in a type librarv into Java .class files. A Java .class file is 
the binary form of a Java class or interface.'" 1 

To use an object defined in a type library from 
Visual Basic, the developer must add a reference to the 
object using the Project menu, References command. 
The References dialog box allows the user to select 
from the list of registered type libraries in a manner 
similar to the Fortran Module Wizard." 

The Fortran Module Wizard is unique in the fol- 
lowing wavs. The Fortran 90 programming language 
does not inherently support objects. The Fortran 
Module Wizard employs a combination of language 
and run time support to prov ide this capability. The 
supporting language features are modules and proce- 
dure pointers. The supporting run-time modules are 
DFCOMTY, DFCOM, and DFAUTO. The Fortran 
Module Wizard provides support for tvpe libraries 
containing the descriptions of DLL. routines. 

Fortran Module Wizard Architecture 

The architecture ofthc Fortran Module Wizard is fairly 
simple. The shell of the Wizard was generated by the 
Custom App Wizard within Visual C++. The inner 
workings ofthc Wizard consist of three major pieces: 

■ Type information reader 

■ Tvpe symbol table 

■ Fortran code generator 

Figure 1 1 shows a high-level data flow of the 
Fortran Module Wizard. The tvpe information reader 



traverses the data structures in the type information 
and creates the type symbol table. The Win32 SDK 
provides a sample application named BROWSE OLF 
sample that is an example of traversing the information 
in a tvpe librarv. The type symbol table is a symbol 
table similar to those used by compilers. It maps type 
names to the descriptions of types. For simplicity, the 
information is stored using the same data structures 
used by the type information. The Fortran code gen- 
erator traverses the symbol table and generates a 
Fortran module. 

The use of a symbol table allows for a complete 
separation of the functionality of the tvpe information 
reader from the Fortran code generator. A code gener- 
ator for another programming language could be 
easily substituted, as could another source of type 
information (for example, a C header file). 

Future Directions 

There are a number of possibilities for future work that 
would add to the capabilities provided by the Fortran 
Module Wizard. 

■ Fortran support for ActiveX controls. An ActiveX 
control is an Automation object. It is a reusable 
component that normally provides a user interface 
and is used in dialog boxes and other windows. The 
Fortran Module Wizard can generate a module 
that would allow a Fortran developer to use the 
methods and properties of an ActiveX control. 
However, additional functionality would be needed 
in the Fortran run-time libraries to make controls 
usable from a Fortran application. A control has 
to be placed in a special type of window called a 
Control Container. The Fortran run-time libraries 
do not currently contain support for a Control 
Container. In addition to methods and properties, 
a control can define events. An event allows a con- 
trol to notify its container when something of inter- 
est happens to the control. For example, a "Button 
control" could define a "Clicked event." 

■ Fortran Windows Application Wizard. This Wizard 
could generate starter files for a Fortran Windows 
application. This would be especially useful if we 
were to implement the Fortran support for ActiveX 
controls. 
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INFORMATION 
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Figure 11 

Dar.i Flow ofthc Fortran Module Wizard 
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■ Fortran modules from C header Hies. Bv replacing 
the type information reader described in the previ- 
ous section with a C parser, we could generate 
Fortran modules directly H orn ,h files. This would 
expand the set of services that arc easily available to 
Fortran developers. 

■ Fortran Server Wizard. This Wizard would take a 
Fortran module provided by a Fortran developer 
and package it as a COM object. It would also gen- 
erate a type library that describes the object. This 
object could then be used by any COM client, for 
example, Visual Basic, Visual C++, and Visual J++ 
applications. 
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The MEMORY CHANNEL network is a dedicated 
cluster interconnect that provides virtual shared 
memory among nodes by means of internodal 
address space mapping. The interconnect imple- 
ments direct user-level messaging and guaran- 
tees strict message ordering under all conditions, 
including transmission errors. These character- 
istics allow industry-standard communication 
interfaces and parallel programming paradigms 
to achieve much higher efficiency than on con- 
ventional networks. This paper presents an 
overview of the MEMORY CHANNEL network 
architecture and describes DIGITAL'S crossbar- 
based implementation of the second-generation 
MEMORY CHANNEL network, MEMORY CHANNEL 2. 
This network provides bisection bandwidths 
of 1,000 to 2,000 megabytes per second and a 
sustained process-to-process bandwidth of 
88 megabytes per second. One-way, process- 
to-process message latency is less than 2.2 
microseconds. 



In computing, a cluster is loosely defined as a parallel 
system comprising a collection of stand-alone comput- 
ers (each called a node) connected by a network. Each 
node runs its own copy of the operating system, and 
cluster software coordinating the entire parallel system 
attempts to provide users with a unified system view. 
Since each node in the cluster is an off-the-shelf 
computer system, clusters offer several advantages 
over traditional massively parallel processors (MPPs) 
and large-scale symmetric multiprocessors (SMPs). 
Specifically, clusters provide 1 

■ Much better price/performance ratios, opening a 
wide range of computing possibilities for users who 
could not otherwise afford a single large system. 

■ Much better availability. With appropriate software 
support, clusters can survive node failures, whereas 
SMP and MPP systems generally do not. 

■ Impressive scaling (hundreds of processors), when 
the indmdual nodes are medium-scale SMP systems. 

■ Easy and economical upgrading and technology 
migration. Users can simply attach the latest- 
generation node to the existing cluster network. 

Despite their advantages and their impressive peak 
computational power, clusters have been unable to 
displace traditional parallel systems in the marketplace 
because their effective performance on many real- 
world parallel applications has often been disappoint- 
ing. Clusters' lack of computational efficiency can be 
attributed to their traditionally poor communication, 
which is a result of the use of standard networking 
technology as a cluster interconnect. The develop- 
ment of the MEMORY CHANNEL network as a cluster 
interconnect was motivated by the realization that the 
gap in effective performance between clusters and 
SMPs can be bridged by designing a communication 
network to deliver low latency and high bandwidth all 
the way to the user applications. 

Over the years, many researchers have recognized 
that the performance of the majority of real-world par- 
allel applications is affected by the latency and band- 
width available for communication. 2 "* In particular, 
it has been shown 20 - 7 that the efficiency of parallel 
scientific applications is strongly influenced by the 
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system's architectural balance as quantified by its 
communication-to-computation ratio, which is some- 
times called the q-ratio. 2 The q -ratio is defined as 
the ratio between the time it takes to send an 8-bvtc 
floating-point result from one process to another 
(communication) and the time it takes to perform a 
floating-point operation (computation). In a system 
with a q-ratio equal to 1, it takes the same time for a 
node to compute a result as it does for the node to 
communicate the result to another node in the system. 
Thus, the higher the q-ratio, the more difficult it is to 
program a parallel system to achieve a given level of 
performance. Q-ratios close to unity have been 
obtained onlv in experimental machines, such as 
iWarp* and the M-Machine," bv employing direct 
register-based communication. 

Table 1 shows actual q-ratios for several commercial 
systems. 1 " " These q-ratios vary from about 100 for n 
DIGITAL AlphaServer 4 100 SMP system using shared 
memory to 30,000 for a cluster of these SMP systems 
interconnected over a fiber distributed data interface 
(J-'DDI) network using the transmission control 
protocol/internet protocol (TCP/IP). An MPP 
system, such as the IRM SP2, using the Message 
Passing Interface (MPI) has a q-ratio of 5,714. The 
MEMORY CHANNEL network developed by Digital 
Equipment Corporation reduces the q-ratio of an 
AlphaServer-based cluster by a factor of 38 to 82 to be 
within the range of 367 to 1 ,067. Q-ratios in this 
range permit clusters to efficiently tackle a large class 
of parallel technical and commercial problems. 

The benefits of low-latency, high-bandwidth 
networks are well understood.' 2 " As shown by many 
studies, 1 '" 5 high communication latency over tradi- 
tional networks is the result of the operating system 
overhead involved in transmitting and receiving mes- 
sages. The MEMORY CHANNEL network eliminates 
this latency by supporting direct process-to-process 
communication that bypasses the operating system. 



The MEMORY CHANNEL network supports this type 
of communication by implementing a natural exten- 
sion of the virtual memory space, which provides 
direct, but protected, access to the memory residing in 
other nodes. 

Rased on this approach, DIGITAL developed 
its first-generation MEMORY CHANNEL network 
(MEMORY CHANNEL 1 ),"' which has been shipping 
in production since April 1996. The network does not 
require any functionality beyond the peripheral com- 
ponent interconnect (PCI) bus and therefore can be 
used on any system with a PCI I/O slot. DIGITAL 
currently supports production MEMORY CHANNEL 
clusters as large as 8 nodes bv 12 processors per node 
(a total of 96 processors). One of these clusters was 
presented at Supercomputing '95 and ran clusterwide 
applications using High Performance Fortran (HPE);' 
Parallel Virtual Machine (PVM),' 7 and MPI 1S in 
DIGITAL'S Parallel Software Environment (PSE). This 
96-processor system has a q-ratio of 500 to 1,000, 
depending on the communication interface. A4-node 
MEMORY CHANNEL cluster running DIGITAL 
TruClustcr software 1 '' and the Oracle Parallel Server 
has held the cluster performance world record on the 
TPC-C benchmark 2 " — the industry standard in on-line 
transaction processing — since April 1996. 

We next present an overview of the generic 
MEMORY CHANNEL network to justify the design 
goals of the second-generation MEMORY CHANNEL 
network (MEMORY CHANNEL 2). Following this 
overview, we describe in detail the architecture of 
the two components that make up the MEMORY 
CHANNEL 2 network: the hub and the adapter. Last, 
we present hardware- measured performance data. 

MEMORY CHANNEL Overview 

The MEMORY CHANNEL network is a dedicated 
cluster interconnection network, based on Encore's 



Table 1 



Comparison of Communication and Computation Performance (q-ratio) for Various Parallel Systems 


System 


Communication 
Performance 
Latency 
(Microseconds) 


Computation 
Performance Based on 
LINPACK 100x100 
(Microseconds/FLOP) 


Communication- 
to-computation 
Ratio 
(q-ratio) 


AlphaServer 4100 Model 300 configurations 

SMP using shared memory messaging 


0.6 


0.006 


100 


SMP using MPI 


3.4 


0.006 


567 


FDDI cluster using TCP/IP 


180.0 


0.006 


30,000 


MEMORY CHANNEL cluster using 
native messaging 


2.2 


0.006 


367 


MEMORY CHANNEL cluster using MPI 


6.4 


0.006 


1,067 


IBM SP2 using MPI 


40.0 


0.006 


5,714 
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MEMORY CHANNEL technology, that supports 
virtual shared memory space by means of intcrnodal 
memory address space mapping, similar to that used 
in the SHRIMP system.' 1 The MEMORY CHANNEL 
substrate is a flat, fully interconnected network 
that provides pw.s/b-only message-based communica- 
tion.'" 22 Unlike traditional networks, the MEMORY 
CHANNEL network provides low-latency communi- 
cation by supporting direct user access to the network. 
As in Scalable Coherent Interface (SCI) 23 and Myrinet 24 
networks, connections between nodes are established 
by mapping part of the nodes' virtual address space to 
the MEMORY CHANNEL interface. 

A MEMORY CHANNEL connection can be opened 
as either an outgoing connection (in which case an 
address-to-destination node mapping must be pro- 
vided) or an incoming connection. Before a pair of 
nodes can communicate by means of the MEMORY 
CHANNEL network, they must consent to share part 
of their address space — one side as outgoing and the 
other as incoming. The MEMORY CHANNEL net- 
work has no storage of its own. The granularity of the 
mapping is the same as the operating system page size. 

MEMORY CHANNEL Address Space Mapping 

Mapping is accomplished through manipulation of 
page tables. Each node that maps a page as incoming 
allocates a single page of physical memory and makes 
it available to be shared by the cluster. The page is 
always resident and is shared by all processes in the 
node that map the page. The first map of tfie page 
causes the memory allocation, and subsequent 



reads/maps point to the same page. No memory is 
allocated for pages mapped as outgoing. The mapper 
simply assigns the page table entry to a portion of the 
MEMORY CHANNEL hardware transmit window and 
defines the destination node for that transmit sub- 
space. Thus, the amount of physical memory con 
sumed for the clusterwide network is the product of 
the operating system page size and the total number 
of pages mapped as incoming on each node. 

After mapping, MEMORY CHANNEL accesses are 
accomplished by simple load and store instructions, as 
for any other portion of virtual memory, without any 
operating system or run-time library calls. A store 
instruction to a MEMORY CHANNEL outgoing 
address results in data being transferred across the 
MEMORY CHANNEL network to the memory allo- 
cated on the destination node. A load instruction from 
a MEMORY CHANNEL incoming channel address 
space results in a read from the local physical memory 
initialized as a MEMORY CHANNEL incoming chan- 
nel. The overhead (in CPU cycles) in establishing a 
MEMORY CHANNEL connection is much higher than 
that of using the connection. Because of the memory- 
mapped nature of the interface, the transmit or receive 
overhead is similar to an access to local main memory. 
This mechanism is the fundamental reason for the low 
MEMORY CHANNEL latency. Figure I illustrates an 
example of MEMORY CHANNEL address mapping. 

The figure shows two sets of independent connec- 
tions. Node 1 has established an outgoing channel to 
node 3 and node 4 and also an incoming channel 
to itself. Node 4 has an outgoing channel to node 2. 




Figure 1 

MEMORY CHANNEL Mapping of'a Portion of the Clusterwide Address Space 
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All connections are unidirectional, either outgoing 
or incoming. To map a channel as both outgoing and 
incoming to the same shared address space, node 1 
maps the channel two times into a single process' vir- 
tual address space. The mapping example in Figure 1 
requires a total offour pages of physical memory, one 
for each of the four arrows pointed toward the nodes' 
virtual address spaces. 

MEMORY CHANNEL mappings reside in two page 
control tables (PCTs) located on the MEMORY 
CHANNEL interface, one on the sender side and one 
on the receiver side. As shown in Figure 2, each page 
entry in the PCT has a set of attributes that specify 
the MEMORY CHANNEL behavior for that page. 

The page attributes on the sender side are 

■ Transmit enabled, which must be set to allow trans- 
mission from store instructions to a specific page 

■ Local copy on transmit, which directs an ordered 
copy of the transmitted packet to the local memory 

■ Acknowledge request, which is used to request 
acknowledgments from the receiver node 

■ Transmit enabled under error, which is used in 
error recovery communication 

■ Broadcast or point-to-point, which defines the 
type of packet to all nodes or to a single node 
i n the cluster 

■ Request acknowledge, which requests a reception 
acknowledgment from the receiver 

The page attributes on the receiver side are 

■ Receive enabled, which must be set to allow recep- 
tion of messages addressed to a specific virtual page 

■ Interrupt on receive, which generates an interrupt 
on reception of a packet 

■ Receive enabled under error, which is asserted for 
error recovery communication pages 

■ Remote read, which identifies all packets that arrive 
at a page as requests for a remote read operation 

■ Conditional write, which identifies all packets that 
arrive at a page as conditional write packets 



MEMORY CHANNEL Ordering Rules 

The MEMORY CHANNEL communication paradigm 
is based on three fundamental ordering rules: 

1 . Single-sender Rule: All destination nodes will 
receive packets in the order in which they were gen- 
erated by the sender. 

2. Multisender Rule: Packets from multiple sender 
nodes will be received in the same order at all desti- 
nation nodes. 

3. Ordering-under-errors Rule: Rules 1 and 2 must 
apply even when an error occurs in the network. 

Let P/ M . x be the /th point-to-point packet from 
a sender node M to a destination node X, and let Bj M 
be the /th broadcast packet from node M to all other 
nodes. If node M sends the following sequence of 
packets, 

P2m-s,P1m..v,B1 m ,P1.m,x, 
(last) (first) 

Rule 1 dictates that nodes X and Y will receive the 
packets in the following order: 

at node X, P2 M , v , B1 M , PI M .s 
(last) (first) 

at node Y, P1 M . V ,B1 M . 

(last) (first) 

If a node N is also sending a sequence of packets, in 
the following order, 

IWx, P2^ x , B2 N , P2 N .„ B1 N , P1 N .v, P1 N „ X , 
(last) (first) 

there is a finite set of valid reception orders at destina- 
tion nodes X and Y, depending on the actual arrival 
time of the requests to the point of global ordering. 
Rule 1 dictates that all packets from node M (or N) to 
node X (or Y) must arrive at node X (or Y) in die order 
in which they were transmitted. Rule 2 dictates that, 
regardless of the relative order among the senders, 
messages destined to both receivers must be received 
in the same order. For example, if X receives B2 N , B1 M , 
and B1 N ,, then Y should receive these packets in the 
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same order. One arrival order congruent with both of 
these rules is the following: 

at node X, 

P3 N , x , P2 N . x , P2m .x, B2 N , Bl„, B1 N) Pl s . x , P1, M , v 
(last) (first) 

at node Y, 

B2 N ,P2 N ,. v ,Pl M . y ,Bl A „Bl N ,Pl N ,,.. 

These rules are independent of a particular intercon- 
nection topology or implementation and must be 
obeyed in all generations of the MEMORY CHANNEL 
network. 

On the MEMORY CHANNEL network, error han- 
dling is a shared responsibility of the hardware and the 
cluster management software. The hardware provides 
real-time precise error handling and strict packet 
ordering by discarding all packets in a particular path 
that follow an erroneous one. The software is respon- 
sible for recovering the network from the faulty state 
back to its normal state and for retransmitting the lost 
packets. 

Additional MEMORY CHANNEL Network Features 

Three additional features of the MEMORY CHANNEL 
network make it ideal tor cluster interconnection: 

1 . A hardware- based barrier acknowledge that sweeps 
the network and all its buffers 

2. A fast, hardware-supported lock primitive 

3. Node failure detection and isolation 

Because of the three ordering rules, the MEMORY 
CHANNEL network acknowledge packets are imple- 
mented with little variation over ordinary packets. To 
request acknowledgment of packet reception, a node 
sends an ordinary packet marked with the request- 
acknowledge attribute. The packet is used to sweep 
clean the network queues in the sender destination 
path and to ensure that all previously transmitted pack- 
ets have reached the destination. In response to the 
reception of a MEMORY CHANNEL acknowledge 
request, the destination node transmits a MEMORY 
CHANNEL acknowledgment back to the originator. 
The arrival of the acknowledgment at the originating 
node signals that all preceding packets on that path 
have been successfully received. 

MEMORY CHANNEL locks are implemented using 
a lock-acquire software data structure mapped as both 
incoming and outgoing by all nodes in the cluster. 
That is, each node will have a local copy of the page 
kept coherent by the mapping. To acquire a lock, a 
node writes to the shared data structure at an offset 
corresponding to its node identifier. MEMORY 
CHANNEL ordering rules guarantee that the write 
order to the data structure — including the update of 



the copy local to the node that is setting the lock — 
is the same for all nodes. The node can then determine 
if it was the only bidder for the lock, in which case 
the node has won the lock. If the node sees multiple 
bidders for the same lock, it resorts to an operating 
system-specific back-off-and-retry algorithm. Thanks 
to the MEMORY CHANNEL guaranteed packet order- 
ing, even under error the above mechanism ensures 
that at most one node in the cluster perceives that 
it was the first to write the lock data structure. To 
guarantee that data structures are never locked indefi- 
nitely by a node that is removed from a cluster, the 
cluster manager software also monitors lock acquisi- 
tion and release. 

The MEMORY CHANNEL network supports a 
strong-consistency shared -memory model due to its 
strict packet ordering. In addition, the I/O operations 
used to access the MEMORY CHANNEL are fully 
integrated within the node's cache coherency scheme. 
Besides greatly simplifying the programming model, 
such consistency allows for an implementation of 
spinlocks that does not saturate the memory system. 
For instance, while a receiver is polling for a flag 
that signals the arrival of data from the MEMORY 
CHANNEL network, the node processor accesses only 
the locally cached copy of the flag, which will be 
updated whenever the corresponding main memory 
location is written by a MEMORY CHANNEL packet. 

Unlike other networks, the MEMORY CHANNEL 
hardware maintains information on which nodes are 
currently part of the cluster. Through a collection of 
timeouts, the MEMORY CHANNEL hardware con- 
tinuously monitors all nodes in the cluster for illegal 
behavior. When a failure is detected, the node is iso- 
lated from the cluster and recovery software is 
invoked. A MEMORY CHANNEL cluster is equipped 
with software capable of reconfiguration when a node 
is added or removed from the cluster. The node is 
simply brought on-line or off-line, the event is broad- 
cast to all other nodes, and operation continues. To 
provide tolerance to network failures, the cluster can 
be equipped with a pair of topologicals identical 
MEMORY CHANNEL networks, one for normal oper- 
ation and the other for tai lover. That is, when 
a nonrecoverable error is detected on the active 
MEMORY CHANNEL network, the software switches 
over to the standby network, in a manner transparent 
to the application.' 9 

The First-generation MEMORY CHANNEL Network 

The first generation of the MEMORY CHANNEL 
network consists of a node interface card and a con- 
centrator or hub. The interface card, called an adapter, 
plugs into the I/O PCI. To send a packet, the CPU 
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writes to the portion of I/O space mapped to the PCI 
bus. The store-to-memory is handled by the node's 
PCI interface device, which initiates a PCI transfer tar- 
geting the MEMORY CHANNEL adapter transmit 
window. When a message is received, the MEMORY 
CHANNEL adapter initiates a PCI transfer to write to 
the node's CPU memory, targeting the node's PCI 
interface, wliich then accesses the node's main memory. 

Besides writing to the node's CPU, an I/O device 
on the PCI bus can transmit directly to a MEMORY 
CHANNEL adapter. This allows, for example, a disk 
controller to transfer data directly from the disk to a 
remote node's memory. The data transfer docs not 
affect the host system's memory bus. The design 
choice of interfacing MEMORY CHANNEL to the 
PCI bus instead of directly to the node CPU bus is 
not an architectural one, but rather one of practical- 
ity and universality. The PCI is available on most of 
today's systems of varying performance and size and 
is, therefore, an ideal interface point that allows 
hybrid clusters to be built. The obvious disadvan- 
tages of a peripheral interface bus are the additional 
latency incurred because of the extra CPU-to-PCI 
hop and a possible limitation on the available bus 
bandwidth. 

The MEMORY CHANNEL 1 hub is a broadcast- 
only shared bus capable of interconnecting up to 
eight nodes. The MEMORY Channel 1 adapters and 
the hub are interconnected in a star topology via 
37-bit-wide (32 bits of data plus sideband signals) 
half-duplex channels. The cables can be up to 4 meters 
long, and the signaling level is 5-volt TTL. A two- 
node cluster can be formed without employing a hub, 
by direct node-to-node interconnection. This config- 
uration is also known as virtual hub configuration. 

The current release of the MEMORY CHANNEL 1 
hardware achieves a sustained point-to-point band- 
width of 66 megabytes per second (MB/s) (from user 
process to user process). Maximum sustained broad- 
cast bandwidth is also 66 MB/s (from a user process 
to many user processes). The cross-section MEMORY 
CHANNEL 1 hub bandwidth is 77 MB/s. Small 
message latency is 2.9 microseconds (|xs) (from a 
sender process STORE instruction to a message 
LOAD by a receiver process). The processor overhead 
is less than 1 50 nanoseconds (ns) for a 32-byte packet 
(which is also the largest packet size). 

As demonstrated in the literature, standard message- 
passing application programming interfaces (APIs) 
benefit greatly from these MEMORY CHANNEL 
communication capabilities.' 21 ' 1 ' MPI, PVM,and HPF 
on MEMORY CHANNEL 1 all have one-way message 
latencies of less than 10 |xs. These latency numbers 
are more than a factor of five lower than those for 
traditional MPP architectures (52 to 190 ixs). 11 



Communication performance improvements of this 
magnitude translate into cluster performance gains 
of 25 to 500 percent. 12 

MEMORY CHANNEL 2 Architecture 

Based on the experience with the first-generation 
product, the design goals for MEMORY CHANNEL 2 
were twofold: (1) yield a significant performance 
improvement over MEMORY CHANNEL 1, and (2) 
provide functional enhancements to extend hardware 
support to new operating systems and programming 
paradigms. 

The MEMORY CHANNEL 2 performance/hard- 
ware enhancement goals were 

■ Network bisection bandwidth scalable with the 
number of nodes: 1,000 MB/s for an 8-node clus- 
ter and 2,000 MB/s for a 16- node cluster 

■ Improved point-to-point bandwidth, exploiting 
the maximum capability of the 32-bit PCI bus: 
97 MB/s for 32-bytc packets and 127 MB/s 
for 256-byte packets 

■ Full-duplex channels to allow simultaneous bidirec- 
tional transfers 

■ Maximum copper cable length of 10 meters 
(increased from 4 meters on MEMORY CHANNEL 
1 ) and fiber support up to 3 kilometers 

■ A link layer communication protocol compatible 
with future generations of MEMORY CHANNEL 
hardware and optical fiber interconnections 

■ Enhanced degree of error detection 

The MEMORY CHANNEL 2 functional/software 
enhancement goals were 

■ Software compatible with the first-generation 
MEMORY CHANNEL hardware 

■ Rcceive-side address remapping and variable page 
size to better support new operating systems, such 
as Windows NT, and non-Alpha microprocessors 

■ Remote read capabilities 

■ Global time synchronization mechanism 

■ Conditional write access to support a faster recover- 
able messaging 

These two sets of requirements translate into archi- 
tectural and technological constraints that define the 
MEMORY CHANNEL 2 design space. To increase the 
bisection bandwidth, the hub had to implement an 
architecture that supported concurrent transfers. On 
MEMORY CHANNEL 1, all senders must arbitrate 
for the same hub resource (the bus) on every data 
transfer. Every data transmission occupies the entire 
MEMORY CHANNEL 1 hub for the duration of its 
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transfer, and all message filtering is performed by the 
receivers. Substantial network traffic causes conges- 
tion because all sender nodes fight for the same 
resource. This congestion results in a decrease in the 
communication speed and thus an increase in the 
effective q- ratio as seen by the applications. 

On MKMORY CHANNEL 2, the hub has been 
designed as an /V-bv-A'nonblocking full-duplex cross- 
bar with broadcast capabilities, with N = 8 or N = 16. 
Such an architecture prov ides a bisection bandwidth 
that scales with the number of nodes and thus remains 
matched to the point-to-point bandwidth of the indi- 
vidual channels while avoiding congestion among 
independent communication paths. Therefore, an 
increase in network traffic will have little effect on the 
effective q- ratio. 

The MKMORY CHANNEL ordering rules are easily 
met on a crossbar of this type, as follows: 

1. The single-sender ordering rule is naturally obeyed 
bv the tact that the architecture provides a single 
path from any source to any destination. 

2. The multisender ordering rule is enforced bv taking 
over all the crossbar routing resources during 
broadcast. Although less efficient than broadcast 
by packet replication, this technique ensures a strict 
common ordering for all destinations. 

Finally, crossbar switches are practical to implement 
for a modest number of nodes (8 to 32), but given 
the availability of medium-size SMPs, they provide a 
satisfactory degree of scaling for the great majority of 
practical clustering applications. For instance, cluster 
technology can easily provide a 1 ,000-proccssor 
system simply by connecting 32 nodes, each one a 
32-wav SMP. 

The requirement for a higher point-to-point band- 
width called for a shift from half-duplex to full-duplex 
links. A longer cable length imposed the choice of a 
signaling technique other than the TTL employed in 
the MKMORY CHANNEL 1 network. The design 
team adopted low-voltage differential signaling 
(LVDS)-' 6 as the signaling technique for the second and 
future generations of the MKMORY CHANNEL 
network on copper. One of the major decisions that 
faced the team was whether to maintain the parallel 
channel of MKMORY CHANNKL 1 or to adopt a ser- 
ial channel to minimize skew transmission problems 
for large communication distances. The bandwidth 
demands of future cluster nodes indicated that serial 
links would not provide sufficient bandwidth expan- 
sion capabilities at reasonable cost. Thus, the channel 
data path width was chosen to be 16 bits, a suitable 
compromise that would offer a manageable channel- 
to-channel skew while providing the required band- 
width. Figure 3 illustrates the distinctions between the 
first- and second -generation MKMORY CHANNKL 
architectures. 



MEMORY CHANNEL 2 Link Protocol 

The MKMORY CHANNKL 2 communication proto- 
col was engineered with the goal of ensuring compati- 
bility with optical fiber's unidirectional medium. The 
interconnection substrate consists of a pair of unidirec- 
tional channels, one incoming and one outgoing. 
Each channel consists of a 16-bit data path, a framing 
signal, and a clock. The channel carries two types of 
packets: data and control. Data packets vary in size and 
carry application data. Control packets are used to 
exchange flow control, port state, and global clock 
information. Control packets take priority over data 
packets. They are inserted immediately when flow 
control state change is needed and, otherwise, are 
generated on a regular interval (millisecond) to update 
less time-critical state. The MKMORY CHANNKL 2 
data packet format is shown in Figure 4a. The header 
of the data packet contains a packet type (TP), a 
destination identifier (DNI1V), a remote command 
(CMD), and a sender identifier (SID). The data pay- 
load starts with the destination address and can vary 
in length from 4 to 256 bytes (two to one hundred 
twentv-eight 16-bit cycles). It is followed by two 
16-bit cycles of Reed-Solomon error detection code. 

The control packet format is shown in Figure 4b. 
The packet is identified bv a distinct TP and carries 
network and flow control information such as port 
status (PSTAT), configuration (CFG), DNID, hub 
status, and global status. 

Similar to MEMORY CHANNKL 1, MKMORY 
CHANNKL 2 uses a clock-forwarding technique in 
which the transmit clock is sent along with the data 
and is used at the receiver to recover the data. Data is 
transmitted on both edges of the forwarded clock, and 
a novel dynamic retiming technique is used to syn- 
chronize the incoming packets to the node's local 
clock. The retiming circuit locks onto a good sample 
of the incoming data at the start of every packet and 
ensures accurate synchronization for the packet dura- 
tion, as long as predefined conditions on maximum 
packet size and clock drifts are maintained. 

The MKMORY CHANNKL 2 link protocol has 
an embedded autoconfiguration mechanism that is 
invoked whenever a node goes on-line. The hub port 
and the adapter use this autoconfiguration mechanism 
to negotiate the mode of operation (link frequency, 
data path width, etc.). The same mechanism allows a 
two-node hu bless system (a virtual hub configuration) 
to consistently assign node identifiers without any 
operator interv ention or module jumpers. 

MEMORY CHANNEL 2 Enhanced Software Support 

MEMORY CHANNEL 2 provides four major addi- 
tions to application and operating system support: 
( 1 ) receive-side address remapping, (2) remote reads, 

(3) a global clock synchronization mechanism, and 

(4) conditional writes. 



Digital Technical Journal 



Vol. 9 No. I 1997 



(a) MEMORY CHANNEL 1 Network 
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On MKMORY CHANNKL 1 clusters, the network 
address is mapped to a local page of physical memory 
using remapping resources contained in the system's 
PCI-to-host memory bridge. All AlphaServer systems 
implement these remapping resources. Other sys- 
tems, particularly those with 32-bit addresses, do nor 
implement this PCI-to-host memory remapping 
resource. On MEMORY CHANNEL 2, software has 
the option to enable remapping in the receiver side 
of the MEMORY CHANNEL 2 adapter on a per- 
il etwork- page basis. When configured for remapping, 
a section of the PCT is used to store the upper address 
bits needed to map any network page to any 32-bit 
address on the PCI bus. Such enhanced mapping 
capability will also be used to support remote access 
to PCI peripherals across the MEMORY CHANNEL 
network. 

A simple remote read primitive was added to 
MEMORY CHANNEL 2 to support research into 
software-assisted shared memory. The primitive 
allows a node to complete a read request to another 
node without software intervention. It is imple- 
mented by a new remote read-on-write attribute in 
the receive page control table. The requesting node 
generates a write with the appropriate remote address 
(a read-request write). When the packet arrives at the 
receiver, its address maps in the PCT to a page marked 
as remote read. After remapping (if enabled), the 
address is converted to a PCI read command. The 
read data is returned as a MEMORY CHANNEL write 
to the same address as the original read-request write. 
Since read access to a page of memory in a remote- 
node is provided by a unique network address, privi- 
leges to write or read cluster memory remain com- 
pletely independent. 

A global clock mechanism has been introduced to 
provide support for clusterwide synchronization. 
Global clocks, which arc highly accurate, are extremely 
useful in many distributed applications, such as parallel 
databases or distributed debugging. The MEMORY 
CHANNEL 2 hub implements this global clock by 
periodically sending synchronization packets to all 
nodes in the cluster. The reception of such a pulse 
can be made to trigger an interrupt or, on future 
MEMORY CHANNEL-to-CPU direct-interface sys- 
tems, may be used to update a local counter. The 
interrupt service software updates the offset between 
the local time and the global time. This synchroniza- 
tion mechanism allows a unique clusterwide time to 
be maintained with an accuracy equal to twice the 
range (max - min) of the MEMORY CHANNEL net- 
work latency, plus the interrupt service routine time. 

Conditional write transactions have been intro- 
duced in MEMORY CHANNEL 2 to improve the speed 
of a recoverable messaging system. On MEMORY 



CHANNEL I , the simplest implementation of general- 
purpose recoverable messaging requires a round -trip 
acknowledge delay to validate the message transfer, 
which adds to the communication latency. The 
MEMORY CHANNEL 2's newly introduced condi- 
tional write transaction provides a more efficient 
implementation that requires a single acknowledge 
packet, thus practical ly reducing the associated latency 
by more than a factor of two. 

Memory Channel 2 Hardware 

As suggested in the previous architectural description, 
MEMORY CHANNEL 2 hardware components arc 
similar to those in MEMORY CHANNEL 1, namely 
a PCI adapter card (one per node), a cable, and a 
central hub. 

The MEMORY CHANNEL 2 PCI Adapter Card The PCI 

adapter card is the hardware interface of a node to the 
MEMORY CHANNEL network. A block diagram of 
the adapter is shown in Figure 5. The adapter card is 
functionally partitioned into two subsystems: the PCI 
interlace and the link interface. First in, first out (FIFO) 
queues are placed between the two subsystems. The 
PCI interface communicates with the host system, 
feeds the link interface with data packets to be sent, and 
forwards received packets on to the PCI bus. The link 
interface manages the link protocol and data How: It 
formats data packets, generates control packets, and 
handles error code generation and detection. It also 
multiplexes the data path from the PCI format (32 bits 
at 33 megahertz [MHz]) to the link protocol (16 bits 
at 66 MHz). In addition, the link interface implements 
the conversion to and from LVDS signaling. 

The transmit (TX) and receive (RX) data paths, 
both heavily pipelined, are kept completely separate 
from each other, and there is no resource conflict 
other than the PCI bus access. A special case occurs 
when a packet is received with the acknowledge 
request bit or the loopback bit set: the paths in both 
directions arc coordinated to transmit back the 
response packet while still receiving the original one 
(employing the gray path in Figure 5). During a nor- 
mal MEMORY CHANNEL 2 transaction, the transmit 
pipeline processes a transmit request from the PCI 
bus. The transmit PCT is addressed with a subset of 
the PCI address bits and is used to determine the 
intended destination of the packet and its attributes. 
The transmit pipeline feeds the link interface with data 
packets and appropriate commands through the trans- 
mit FIFO queue. The link interface formats the pack- 
ets and sends them on the link cable. At the receiver, 
the link interface disassembles the packet in an inter- 
mediate format and stores it into the receive FIFO 
queue. The PCI interface performs a lookup in the 
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receiver PCT to ensure that rlie page has been enabled 
for reception and to determine the local destination 
address. 

In the simplest implementation, packets are subject 
to two store -and-fbrward delays — one on the transmit 
path and one on the receive path. Because of the 
atomicity of packets, the transmit path must wait for 
the last data word to be correctly taken in from the 
PCI bus before forwarding the packet to the link inter- 
face. The receive path experiences a delay because the 
error detection protocol requires the checking of the 
last evele before the packet can be declared error-free. 
A setof control/status MEMORY CHANNEL 2 regis- 
ters, addressable through the PCI, is used to set vari- 
ous modes of operation and to read local status of the 
link and global cluster status. 

The MEMORY CHANNEL 2 Hub The hub is the cen- 
tral resource that interconnects all nodes to form 
a cluster. Figure 6 is a block diagram of an 8-by-8 
MEMORY CLIANNEL 2 hub. The hub implements 
a nonlocking S-by-8 crossbar and interfaces to eight 
16-bit-wide full-duplex links by means of a link inter- 
face similar to that used in the adapter. The actual 
crossbar has eight input ports and eight output ports, 
all 16 bits wide. Each output port has an 8-ro-l multi- 
plexer, which is able to choose from one of eight input 
ports. Each multiplexer is controlled bv a local arbiter, 
which is fed decoded destination requests horn the 
eight input ports. The port arbitration is based on a 
fixed-priority, request-sampling algorithm. All requests 
that arrive within a sampling interval are considered of 
equal age and are serviced before any new requests. 
This algorithm, while not enforcing absolute arrival- 
time ordering among packets sent horn different 



nodes, assures no starvation and a fair age-driven prior- 
ity across sampling intervals. 

When a broadcast request arrives at the hub, the 
otherwise independent arbiters synchronize them- 
selves to transfer the broadcast packet. The arbiters 
wait for the completion of the packet currently being 
transferred, disable point-to-point arbitration, signal 
that they are ready for broadcast, and then wait for all 
other ports to arrive at the same synchronization 
point. Once all output ports are ready for broadcast, 
port • proceeds to read from the appropriate input 
port, and all other ports (including port 0) select the 
same input source. The maximum synchronization 
wait time, assuming no output queue blocking, is equal 
to the time it takes to transfer the largest size packets 
(256 bytes), about 4 |xs, and is independent of the 
number of ports. As in any crossbar architecture w ith 
a single point of coherency, such broadcast operation 
is more costly than a point-to-point transfer. Our 
experience has been that some critical but relatively 
low-frcquencv operations (primarily fast locks) exploit 
the broadcast circuit. 

MEMORY CHANNEL 2 Design Process and Physical 
Implementation 

Figure 7 illustrates the main MEMORY CHANNEL, 
physical components. As shown in Figure 7a, two-node 
clusters can be constructed by directly connecting two 
MEMORY CHANNEL PC) adapters and a cable. This 
configuration is called the virtual hub configuration. 
Figure 7b shows clusters interconnected by means of 
a hub. 

The MEMORY CHANNEL adapter is implemented 
as a single PCI card. The hub consists of a mother- 
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board that holds the switch and a set of linecards, one 
per port, that provides the interface to the link cable. 

The adapter and hub implementations use a com- 
bination of programmable logic devices and off-the- 
shelf components. This design was preferred to an 
application-specific integrated circuit (ASIC) imple- 
mentation because of the short timc-to-markct 



requirements. In addition, some of the new function- 
ality will evolve as software is modified to take advan- 
tage of the new features. The MEMORY CHANNEL 2 
design was developed entirely in Verilog at the regis- 
ter transfer level (RTL). It was simulated using the 
Viewlogic VCS event-driven simulator and synthe- 
sized with the Svnopsvs tool. The resulting netlist 




(a) Virtual hub mode: direct node-to-node (b) Using the MEMORY' CHANNEL hub 

interconnection of two PCI adapter cards to create clusters of up to 16 nodes 
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was fed through the appropriate vendor tools for 
placing and routing to the specific devices. Once the 
device w as routed, the vendor tools provided a gate- 
level Verilog netlist w ith timing information, w hich 
was then simulated to verify the correctness of the 
synthesized design. Boardwide static timing analysis 
was run using the Vievvlogic MOTIVE tool. The link 
interface was fitted to a single Lucent Technologies 
Optimized Reconfigurable Cell Array (ORCA) Series 
field-programmable gate array (FPGA) device. The 
PCI interface was implemented with one ORCA 
FPGA device and several high-speed AMI) program- 
mable array logic devices (PALs). Thanks to the in- 
svstem programmability of PAI.s and FPGAs, the 
MEMORY CHANNEL 2 adapter board is designed 
to be completely reprogrammable in the field from 
the svsrem console through the PCI interface. 

MEMORY CHANNEL 2 Performance 

This section presents MEMORY CHANNEL 2 perfor- 
mance data configured in virtual hub mode (direct 
node-to-node connection). Wherever possible actual 
measured results are presented. A two-node 
AlphaServer 4100 5/300 cluster was used for all hard- 
ware measurements. 

Network Throughput 

The MEMORY CHANNEL 2 network has a raw data 
rate of 2 bytes even; 15 ns or 133.3 Mli/s. Messages are 
packeri/.ed by the interface into one or more MEMORY 
CHANNEL packets. Packets with data pnyloads of 4 to 
256 bvtcs arc supported. Figure cS compares, for various 
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Figure 8 

MEMORY CHANNEL 2 Point-to-point Bandwidth 
asa Function of Packer Size, Comparing Network 
Theoretical Limit and Sustained Process- u>- process 
Measured Performance 



packer sizes, the maximum bandwidth the MEMORY 
CHANNEL 2 network is capable of sustaining with the 
effective process-to-process bandwidth achieved using a 
pair of AlphaServer 4 100 systems. With 256 bvtc pack- 
ets, MEMORY CHANNEL 2 achieves 127 MB/s or 
about 96 percent of the raw wire bandwidth. 

For PCI writes of less thanorequalto256 bytes, the 
MEMORY CHANNEL 2 interface simply converts the 
PCI write to a similar-size MEMORY CHANNEL 
packet. The current design does not aggregate multi- 
ple PCI write transactions into a single MEMORY 
CHANNELpacketnnd automatically breaks PCI writes 
larger than 256 bytes into a sequence of 256-bvte 
packets. 

As Figure 8 shows, the bandwidth capability of the 
MEMORY CHANNEL 2 network exceeds the sustain- 
able data rare of the AlphaServ er 4100 system. The 
AlphaServer system is capable of generating 3 2 -byte- 
packets to the MEMORY CHANNEL 2 interface at 
88 MB/s or about 10 percent less than the maximum 
network bandwidth at a 32-byte packet size. This rep- 
resents a 33 percent bandwidth improvement over the 
previous-generation MEMORY CHANNEL, whose 
effective bandwidth was 66 MB/s. An ideal PCI host 
interface would achieve the full 97 MB/s, but the 
current AlphaServer 4100 design inserts an extra PCI 
stall cycle on sustained 32-byte writes to the PCI, The 
32-bvte packet size is a limitation of the Alpha 2 1 164 
microprocessor; future versions of the Alpha micro- 
processor will be able to generate larger w rites to the 
PCI bus. 

Latency 

Figure 9 shows the latency contributions along a 
point-to-point path from a sending process on node 
1 to a receiv ing process on node 2. Using a simple 
8 -bvtc ping-pong test, we determined that the one- 
way latency of this path is 2.17 |xs. In the test, a user 
process on node 1 sends an 8-byte message to node 2. 
Node 2 is polling its memory waiting for the message. 
After node 2 sees the message, it sends a similar mes 
sage back to node 1 . (Node 1 started polling its mem- 
ory after it sent the previous message.) One-way 
latency is calculated by dividing bv two the time it takes 
to complete a ping-pong exchange. Approximately 
330 ns elapse from the time a sending processor issues 
a store instruction until the store propagates to the 
sender's PCI bus. The latency from the sender's PCI to 
the receiver's PCI over the MEMORY CHANNEL 2 
network is about I . I (jls . Writing the main memory on 
the receiver node takes an additional 330 ns. Finally, 
the poll loop takes an av erage of about 400 ns to read 
the flag value from memory. 

'fable 2 shows the process-to-process one-way 
message latency for different types of communications 
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Figure 9 

Latency Contributions along the Path from a Sender ro a Receiver 



at a fixed 8-byte message size. The first row contains 
the result of the ping-pong experiment previously 
described. For comparison, the previous generation 
of MEMORY CHANNEL had a ping-pong latency of 
2.60 u.s. The second row represents the latency for the 
simplest implementation of variable-length messaging. 
The latencies of standard communication interfaces are 
shown in the last two rows, namely, High Performance 
Fortran and Message Passing Interface. The results 
shown in this table are only between two and three 
times slower than the latencies measured for the same 
communication interfaces over the SMP bus of the 
AlphaServer 4100 svstcm. 

Table 2 

MEMORY CHANNEL2 One-way Message Latency 
in Virtual Hub Mode for Different Communication 
Interfaces 



One-way Message Latency 
Communication Type (Microseconds) 



Ping-pong 8-byte message 


2.17 


8-byte message plus 8-byte flag 


2.60 


HPF 8-byte message 


5.10 


MPI 8-byte message 


6.40 



The latency of the MEMORY CHANNEL 2 network 
increases with the size of the message because of the 
presence of store- and -forward delays in the path. As 
discussed in the previous hardware description, all 
packets are subject to two store-and-forward delays, 
one «n the outgoing buffer and one on the incoming 
buffer (required for error checking). These delays also 
play a role in the effective bandwidth of a stream of 
packets. On the one hand, smaller packets are less effi- 
cient than larger ones in term of overhead. On the 
other hand, smaller packets incur a shorter store-and- 
forward delay per packet, which can then be over- 
lapped with the transfer of previous packets on the 
link, making the overall transfer more efficient. The 
hub performs cut-through packet routing with an 
additional delav of about 0.5 p,s. 

Summary and Future Work 

This paper presents an overview of the second- 
generation MEMORY CHANNEL network, MEMORY 
CHANNEL 2. The rationale behind the major design 
decisions are discussed in light of the experience 
gained from MEMORY CHANNEL 1. A description 
of the MEMORY CHANNEL 2 hardware components 
led to the presentation of measured performance results. 
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Compared to other more traditional interconnection 
networks, MKMORV CHANNEL 1 prov ides unparal- 
leled performance in terms of latencv and bandwidth. 
MEMORY CHANNEL 2 further enhances pcrtbr- 
manee by providing point-to-point bandwidth of 97 
Mli/s per second for 32-byte packers, an application- 
to-application latency of less than 2.2 microseconds, 
and a cross-section bandwidth of 1,000 Mli/s for 8 
nodes and 2,000 MB/s tor 16 nodes. It also provides 
enhanced software support to improve the performance 
of the most common operations in a cluster environ- 
ment, e.g., global synchronization, and reduces the 
complexity of the software layer by prov iding a more 
flexible address mapping. In addition, the MEMORY 
CHANNEL 2 network has been designed to be both 
hardware and software compatible with future genera- 
tions on either copper or fiber-optic communication up 
to a distance of 3 kilometers. Future generations of the 
MEMORY CHANNEL architecture w ill benefit Horn the 
MEMORY CHANNEL 2 experience and will continue 
to prov ide enhancements to communication perfor- 
mance and to further refine those mechanisms intro- 
duced to support parallel cluster software. 
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Integrating ObjectBroker 
and DCE Security 



John H. Parodi 
Fred W. Burgher 



The integration of the ObjectBroker software 
product with the Distributed Computing 
Environment (DCE) Security Service makes 
ObjectBroker the most secure object request 
broker (ORB) in the industry. ObjectBroker and 
DCE Security together allow client-to-server, 
server-to-client, and mutual authentication. 
The integrated software provides these security 
functions, as well as message integrity protec- 
tion, transparently to the applications. Integra- 
tion has been accomplished in a way that allows 
plug-in replacement of the ObjectBroker security 
subsystem by DCE Security, Kerberos, or any third- 
party software security product that supports 
the DCE's Generic Security Service Application 
Programming Interface (GSS-API). This approach 
supports future GSS-API-compliant third-party 
security products based on Kerberos and also prod- 
ucts that may address other security technologies 
such as biometrics and smart cards. In addition, 
the approach places responsibility for compliance 
with International Traffic in Arms Regulations in 
the hands of the purveyors and owners of GSS 
libraries rather than with the ORB vendor. Note 
that the ObjectBroker product is middleware 
jointly developed and distributed by DIGITAL and 
BEA Systems, who have formed a worldwide tech- 
nology and distribution partnership. 



An object request broker (ORB) is .1 distributed soft- 
ware layer that translates abstract service requests 
from a client application into requests for specific 
servers, regardless of where those servers actually 
reside on the network. 1 In this way, ORBs provide 
a middle tier in mu I ti tiered client-server systems. The 
ObjectBroker software, developed and distributed 
by strategic partners DIGITAL and BF.A Systems, is 
an implementation of the Common Object Request 
Broker Architecture (CORBA) specified by the Object 
Management Group (OMG). ! 

Security is a growing concern for those who manage 
distributed computing systems, and the security options 
available to the CORBA community have been quite 
limited until recently. In the past year, OMG has 
adopted a specification for a (X)RBA Security Service, 
although few commercially available implementations 
exist at the time of this writing. 

Outside the CORBA community, one widely accepted 
standard for security in distributed, heterogeneous 
systems is the Generic Security Service Application 
Programming Interface (GSS-API)," 1 as specified bv 
The Open Group (which was formed by the merger 
of the Open Software Foundation and X/Open 
Company Ltd.). 5 The GSS-API provides the ability for 
software entities in a distributed application to authen- 
ticate one another and to protect ongoing communi- 
cation between them. The Distributed Computing 
Environment (DCE) Security Service provides an 
implementation of the GSS-API as one way to access 
its security services. 

Plans arc under wav to implement the CORBA 
Security Service in the ObjectBroker software, but 
the implementation specifications were not complete 
when ObjectBroker version 2.6 was designed. At 
present, bv integrating support for GSS-API imple- 
mentations, the ObjectBroker software provides its 
customers state-of-the-art distributed system security 
with the widest choice of security technologies and 
products. The first commercially available GSS-API 
implementation was the Kerberos-based DCE Security 
Service itself, but other implementations, which use 
a variety of security technologies and are produced by 
various independent software vendors, are expected to 
follow soon. 
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Security 

Ensuring secure communication among entities in a 
distributed computer svsrem is a challenging task. The 
term security normally includes three broad classes 
of svstem requirements:" 

1 . Sccrccv/privacv — the ability to protect information 
from unauthorized access 

2. Integrity — the ability to protect information from 
unauthorized alteration or destruction 

3. Availability — the ability to ensure that valid access to 
information can he accomplished in a timely manner 

Enforcement of a security policy is accomplished bv 
way of the following security functions: 

■ Authentication — the verification of the identity of a 
security principal 

■ Authorization — the determination of which princi- 
pals can perform which actions 

■ Access control — the enforcement of the security 
policy, based on authentication and authorization 
information, to determine whether to allow or dis- 
allow a particular action 

The Distributed Computing Environment 

The Open Group's Distributed Computing Environ- 
ment is an integrated, standard set of technologies, 
cools, and services that enables the development and 
deployment of distributed applications in a heteroge- 
neous, multivcndor computing environment." Typic- 
ally, system vendors implement the DCE on their own 
platforms. The DCH has been endorsed bv virtually all 
system vendors, including IBM, HP, DIGITAL, NCR, 
Stratus, Cray, HAL, Hitachi, Siemens Nixdort, NEC, 
Data General, Hull, Tandem, Transare, SCO, Gradient, 
Siemens Pyramid, and Oliv etti. 

The DCH provides the following six technology 
components: 

1. Remote Procedure Call (RPC), which facilitates 
distributed communication 

2. Directory Service, which provides a single naming 
model throughout the distributed environment 

3. Security Service, which provides reliable authenti- 
cation, authorization, and data protection 

4. Distributed Time Service, which synchronizes the 
network system clocks 

5. Distributed Eile Service, which provides access to 
nerworkwide files 

6. Threads Service (The DCH uses POSIX threads 
where available; on operating systems where POSIX 
is not available, the DCH supplies a threads package 
that provides the same interface as POSIX threads.) 



DCH users can be characterized by their need to 
deploy and/or integrate large-scale applications on 
multiple heterogeneous platforms. The most common 
reasons given for choosing the DCH are its security 
features, its scalability, and its robustness. 

DCE Security provides the following services: 

■ The DCE Authentication Service allows users and 
resources to prove their identity to each other. This 
service is currently based on Kerberos, which requires 
that all users and resources possess a secret key. 

■ The DCE Authorization Service verifies operations 
that users may perform on resources. A DCE Registry 
contains a list of valid users. An access control list asso- 
ciated with each resource determines valid users and 
the types of operations a user may perform. 

■ The DCE Data Integrity Service protects network 
data from tampering. Automatically generated 
cryptographic checksums are appended to network 
transmissions, allowing the DCH to determine if 
data has been corrupted in transit. The encrypted 
checksum is a message authentication code (MAC;) 
based on the Data Encryption Standard (DES). 

ObjecrBroker uses the DCH Authentication and Data 
Integrity services. 

ObjectBroker Security 

Although DCE Security provides three basic levels 
of protection (None, Data Integrity, and Privacy), 
ObjectBroker uses only the Data Integrity level. 
This level provides a mechanism that computes an 
encrypted, time-stamped checksum and attaches it 
to the message so that any attempt to change or 
replay the information can be detected. In addition, 
ObjectBroker uses explicit calls to the DCH Security 
library's GSS-API to accomplish authentication but 
maintains its own access control lists and authorization 
database and mediates access control itself 8 

Note that within a DCE cell, it is possible to use the 
DCE RPC with the DCE Security Service to protect 
communication at the wire protocol level. However, 
because ObjectBroker docs not use the DCH RPC 
wire protocol, its use ot the DCH Security Service 
is accomplished by means of explicit calls by 
ObjectBroker to the GSS-API implementation. 

ObjectBroker's use of the DCH Security Service 
provides data integrity protection, authentication of 
clients to servers and servers to clients, and protection 
against replay and sequencing attacks. Although 
encryption is used to create the digital signatures 
that provide these protections at the network Data 
Integrity' level, ObjectBroker does not directly sup- 
port the capability to encrvpt data, even on nodes that 
have Privacy-level DCE Security Service support. 
ObjectBroker provides no protection from denial of 
service attacks either. 
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Of course, a customer's use of DCE Security is 
entirely optional, and the security mechanism used in 
previous versions of the ObjectBroker software is still 
supported. With this mechanism, called trusted secu- 
rity, the node/username associated with a request 
from a remote node is accepted to be as claimed. For 
trusted security, ObjectBroker uses a proxy approach 
in which the node/username associated w ith a remote 
request is mapped to a proxy identity on the server's 
system. An access control decision is thus based on 
the authorization information for the proxy identity. 
The prow approach to the trusted securitv mechanism 
w as ncccssarv because there w as no concept of global 
identity for a user, that is, an identity known to all 
computer nodes in a distributed system. 

To implement DCE Securirv on a particular plat- 
form, a Security Integration Architecture accomplishes 
the mapping of a globally understood username (e.g., a 
use!' or a security principal defined within a OCT. cell or 
a Kerberos realm) to a login of a local user on a particu- 
lar system. Some implementations of DCE Security and 
some systems (for example, the OpenVMS operating 
system) use the notion of integrated or global login, in 
which a local user login also causes a global user login 
to be performed. For the OpenVMS system, the global 
realm is the cluster. For the implementation of DCE 
Security on the DIGITAL UNIX system, the global 
realm is the DCE cell. 

Because an ObjectBroker configuration can include 
platforms that have no implementation of the DCF., 
and because the Securirv Integration Architecture is 
different on every DCF platform, there was no com- 
mon mechanism for ObjectBroker to use to imple- 
ment an integrated global login across all supported 
platforms. Thus, ObjectBroker is limited by the inte- 
grated login capabilities available on other platforms 1 
implementations of the DCF. 

For this reason, ObjectBroker retains a prow mech- 
anism, even for use by nodes that support the IX'K. 
For authentication between such nodes, a generic 
remote host definition (called SecGlobalName) is 
mapped to a local user on the local system. Should a 
server receive a request that requires authentication 
from a client node, the server uses SecGlobalName to 
attempt to match the corresponding global principal 
name to a local user name. 

In other words, because there is no common global 
identity mechanism, ObjcctBroker's proxy implemen- 
tation maps either a trusted remote user or a global 
user identity to a local system identity to accomplish 
a generic mapping between global and local users. 
Rather than map multiple host/username pairs to the 
local prow, the ObjectBroker software maps a single 
SecGlobalName, know n to all nodes in the DCF cell, 
to that proxy whenever possible. 



Mechanism for Global Authentication 

The DCF Security Service provides the mechanism 
for global identity. The mechanism is based on 
Kerberos encryption, which is a private or symmetric 
key scheme (as opposed to a public or asymmetric key 
scheme). A private key scheme requires some trusted 
third-party node to act as a distribution center for 
encryption keys or credentials. Each node or user has a 
key that is known only to the user and the distribution 
center. In DCE Security, the distribution center is 
known as a privilege server.'' 

The following is a simplified description of the 
encryption key protocol between the privilege server 
and a client. The actual key exchange protocol, which 
uses three exchanges and conversion kevs, results in a 
Privileged Access Certificate (PAC) in the possession 
of a client. The PAC!, which is appended to each request, 
contains the authorization information to be coin- 
pared with the access control information stored w ith 
the application server. 

When a client wishes to communicate with a server, 
each must acquire a rime-stamped session key for 
secure communication. The session key is protected in 
several ways. The time stamp means that the key is 
only valid for a limited time (the amount of time is 
configurable), which protects against brute-force 
attempts to break the key and reuse it. .Also, each key is 
host-specific and can only be used from the node for 
which it is issued. Finally, the session key is never sent 
over the network in unencrypted form. 

For a user to initiate a DCEJogin, the client must 
enter its DCEJogin password. To register as an initia- 
tor and acceptor of security contexts, a server uses a 
SERVTAB key file. This file contains an encrypted key 
that permits the server to obtain a set of credentials 
similar to those given to a user. These credentials allow 
the server to accept security contexts from clients or to 
initiate requests (that is, become a client) to other 
servers. The reason for having servers acquire creden- 
tials through the SERVTAB mechanism is that servers 
may be started on demand by the ObjectBroker Agent 
(the component that locates the appropriate server 
to satisfy' a client request) or by system administrators 
who do not want to be burdened by haying to know 
a server password. 

In cither case, the client or the server specifies the 
principal name to be authenticated. The node sends 
the specified principal's name to the privilege server. 
The privilege server returns a session key that is 
encrypted using the principal's password or SERVTAB 
kev. The DCE run-time software running on the local 
system decrypts the session kev using the password or 
SERVTAB kev. Once the client and the server have 
decrypted session kevs, thev can use the kevs to initiate 
secure communication with each other. 
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Thus, if" a server is configured to require authentica- 
tion, then before invoicing a method on that server, 
a client must successfully perform a DCEJogin and 
obtain the credentials needed to establish a security 
context with that server. A client may also require 
authentication from the server to ensure that some 
malicious software is not masquerading as a real server. 

Note that the operations for acquiring credentials 
are accomplished outside the server executable. The 
operations arc performed by the ObjectBroker run- 
time software, based on configuration settings in the 
ObjectBroker Security Registry. The goal is to avoid 
burdening applications with the knowledge of security 
mechanisms. 

Authentication requirements can apply to the 
ObjectBroker Agent as well as to clients and servers. 
The Agent is in fact a separate security principal, 
and one can require client-to-Agent, Agent-to-client, 
Agent-to-server, and server-to-Agent authentication 
in an ObjectBroker configuration — in addition to 
authentication between the client and the server. The 
client or the server can independently set these modes, 
or the ObjectBroker system can require that modes 
be set nodevvide. 

Security Design Issues for ObjectBroker 

The security issues associated with the design of 
ObjectBroker versions 2.6, 2.7, and 3.0 were primar- 
ily those of increasing the security capabilities and 
preserving upward compatibility with previous 
ObjectBroker versions. While compatibility is always 
a concern when upgrading software, ObjectBroker's 
requirements in this area arc particularly stringent 
because customers have mission-critical applications 
running in very large configurations. In some cases, it 
is difficult or impossible to upgrade all ObjectBroker 
nodes at one time, so it must be possible to do a 
rolling upgrade that minimizes the disturbance to the 
configuration and allows uninterrupted operation 
of applications. 

The need for dynamic, plug-in replaceability of 
the security subsystem was an important issue for two 
reasons. First, to provide standards-based solutions to 
computing problems, the ObjectBroker design had to 
allow the integration of any security product that 
implements the GSS-API. The second reason has to do 
with export controls. 

United States government export regulations specify 
that hardware, software, and documentation for cryp- 
tographic products may be exported by license only. 
Specifically, the Department of State's International 
Traffic in Arms Regulations (22 Code of Federal 
Regulations Subchapter M) require that an export 
license be obtained from the department before any 
cryptographic hardware, software, or documentation is 



exported from the United States. An ObjectBroker 
design goal was not to encumber the product with 
export restrictions. Therefore, ObjectBroker itself does 
not include any cryptographic security mechanism. An 
ObjectBroker customer must provide an appropriate 
GSS library; whatever package is available on the system 
is the one ObjectBroker will use. 

ObjectBroker Security Features 

The security features that have been successfully imple- 
mented in the ObjectBroker software include 

■ Client-to-server, server-to-client, and mutual 
authentication 

■ Protection from replay and sequencing attacks and 
integrity protection 

■ Fine-grain control over the authentication mecha- 
nism (per-host, per-server, or per-method) 

■ Ability to demand a new security context for an 
invocation 

■ Ability to apply new security features to applica- 
tions without rebuilding them 

■ Dynamically loadable security libraries 
Usage 

One of the most important characteristics of a secure 
ORB is that applications (clients and servers) need not 
be aware of security operations undertaken on their 
behalf. For ORBs, as well as for other support soft- 
ware, the goal is to avoid burdening applications with 
the need to deal with the complexities of a distributed 
system so that they can concentrate on the application 
problem at hand. 

Therefore, most of ObjectBroker's security-relevant 
operations are invisible to applications. ObjectBroker's 
management utilities are used to specif)' the rules for 
authenticating clients and servers. These rules are 
stored in the ObjectBroker Security Registry, and the 
required authentications are performed automatically. 

There are two exceptions to the general rule of 
keeping security operations invisible to the applica- 
tion. The first is that a client or a server (when acting as 
a client) can explicitly make a call to an ObjectBroker 
API to toggle mutual authentication on or off. This 
operation is allowed as long as it does not diminish the 
security level specified for the ObjectBroker node as a 
whole. In other words, a client can demand mutual 
authentication on a node that does nor require such 
authentication but cannot disable mutual authentica- 
tion if the node does require it. This feature was imple- 
mented to make it possible for clients to enable mutual 
authentication for specific operations that have secu- 
rity relevance. 
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The second exception is that a server can demand 
the creation of a new security context for an invoca- 
tion, which immediately tests the authentication of 
the principal making the request. This is important 
because the GSS-API allows the initiation of a security 
context that has no expiration. Clearly, if a security 
context exists for a long enough period, there may be 
a concern that it is no longer valid. For example, when 
a user's account is revoked from the OCT, Security 
Registry, it is possible that the user's credentials are still 
valid in some existing security context. Establishing a 
new security context forces the DCE run time software 
to go back to the security server and verify the validity 
of the principal. 

Figure 1 illustrates the interaction of ObjcctBroker 
and the DCE Security Service components in the 
establishment of a security context. Once the security 
context is established, it is used in the verification of 
MAC-sealed messages between the server and the 
client. In this illustration, access to the DCE security 
subsystem is depicted as a local call, though accessing 
these services could also be done remotely. 

The sequence of operations in Figure 1 is as follows: 

1 . A method invocation (a client request for a remote 
operation) results in a call to ObjcctBroker'' s secu- 
rity subsystem. 

2. The ObjcctBroker security subsystem in turn 
invokes a GSS routine in the DCE Security library. 
This call determines whether a new security con- 
text needs to be established, which can happen for 
one of two reasons: either it is the first invocation 
of this server from this client or the context refresh 
rate has been specified as per-invocation. 

3. The DCE. Security library executes the call, which 
sets up the security context. (Note that the process 
of deleting an existing security context is not 
shown.) 



4. The security subsystem checks the return status of 
the GSS routine to determine whether the result- 
ing token is to be passed to the inv ocation layer. 

5. If so, the token is passed to the transport laver for 
marshaling. 

6. The client communicates with the server node- 
through the normal ObjcctBroker channel. 

7. The transport layer in the receiving node unmar- 
shals the message, examines the transport message 
header, and passes control to a dispatcher in the 
invocation layer. 

8. Depending on the message type, the message may 
then be passed to a special dispatcher, in this case 
the security dispatcher in the security subsystem. 

9. The security subsystem determines that the mes- 
sage should be handled by the GSS implementa- 
tion and passes the message there. 

10. The DCE Security layer checks the received token 
and if it is valid, accepts the security context. The 
routine generates a context establishment token 
to be passed to the client. The call also returns the 
server's context handle for the security context the 
servei' shares with the client. 

11. The security layer passes the token to the invoca- 
tion layer for marshaling. 

12. The invocation lav er marshals the information and 
sends it as an argument to the low-level transport 
routine call. 

13. This message is senr to the client. 

14. The data is un marshaled. 

15. The message is senr to the security subsystem. 

16. The token is passed to the GSS implementation 
to initialize the security context, with the server- 
supplied token as an argument. The routine 
returns the client's context handle, which is used 
to sign subsequent messages. 
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Performance Considerations 

The benefits of a secure ORB are nor free. If authenti- 
cation is required when a client and server establish a 
connection through a binding, part of that binding 
involves the establishment of a security context. 
Establishment of a security context requires a round- 
trip on the network, during which a token from the 
client is passed to the serv er, and a token is returned 
from the server to the client in the mutual authentica- 
tion case. 

Once established, the security context is used in 
subsequent requests (provided that the configuration 
does not require security context deletion after every 
method invocation). If the same security context is 
reused, the only additional overhead considerations 
arc (1) the signing and verification of requests and 
responses in the client and server, and (2) the security 
context handle (32 additional bytes of information) 
appended to each message passed between the client 
and the server. 

The signing and verification or a signature on a 
request or response is different from the verification 
of the privileges used when the security context is first 
set up, in that verification of a signature does nor 
require a network round-trip. In contrast, when you 
first set up a security context, a network round-trip to 
the privilege server is required, and its overhead is 
significantly more costly than that of the verification 
and signature operations. 

Note thatvvhen a client has multiple object references 
to a single method implementation in a server, a single 
security context can still be used. For example, a derived 
object reference does nor require a new security con- 
text. This is both an optimization and a functional 
requirement, since onlv one security conrexr is allowed 
between a client process and a server implementation. 

Future Work 

The OMG specifies a number of object services in addi- 
tion to the CORBA specification itself. One of the most 
important specifications is for the CORBA Security 
Service. ObjectBroker's integration with OCE Security 
was designed and implemented before the OiVIG's 
CORBA Security Service specification was complete. 
Thus, even though ObjectBroker is the most secure 
ORB available today, it is reasonable to ask when and 
how its security features will be made compliant with 
the latest specifications from the OMG. 

Given sufficient resources, ObjectBroker engineer- 
ing could investigate supporting CORBA2 inter- 
operability by implementing the OMG's General 
Inter-ORB Protocol (GIOP). The GIOP architecture 
supports both the Internet Inrcr-ORB Protocol (HOP) 
and the DCE based Common Inrcr-ORB Protocol 



(DCE-CIOP). Today, ObjectBroker uses a wire proto- 
col based on the CORBA version 1.2 specification. 

Security for the HOP is governed bv the Secure Inter- 
ORB Protocol (SF.CIOP) specification 10 , although few 
commercially available implementations or" the SECIOP 
are available at the time of this writing. Also, as men- 
tioned previously, security for the DCE-CIOP is accom- 
plished by protecting the RPC connections at the wire 
protocol level. For the DCH RPC, the DCH does its 
own authentication for RPC sessions; here the RPC 
connection between the client and the server, rather 
than the client and the server themselves, is authenti- 
cated. This approach prov ides the same potential for 
security management in the ORB configuration; it 
simply accomplishes the security functions at a level in 
the protocol stack that docs not require the use of the 
GSS-API. Bv building in support for the GIOP, 
ObjectBroker gains the capability to provide the secu- 
rity features for both the HOP and the DCE-CIOP 
protocols in future releases. 

The SECIOP and the DCE-CIOP both follow the 
usage model of minimizing the need for applications 
to be aware of security. In the SECIOP, the OMG 
has specified APIs for security functions, and these 
functions are entirely separate from any mechanism 
that implements them. ORB vendors will be free to 
provide security features in much the same way that 
ObjectBroker provides security today, i.e., bv working 
from security-related information kept by the ORB. 
The SECIOP also provides for administrative objects 
and operations that perform security management 
functions by means of APIs. 

Conclusion 

ObjectBroker provides state-of-the-art distributed 
system security today. Its security features provide 
upward compatibility, as well as the least possible dis- 
turbance to existing ObjectBroker applications and 
configurations. In addition, ObjectBroker's imple- 
mentation of security bv means of the DCE's Generic 
Security Service Application Programming Interface 
provides the greatest possible choice among security 
mechanisms and security implementation providers. 
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This paper describes a 1 60 MHz 500 mW 
StrongARM microprocessor designed for low- 
power, low-cost applications. The chip imple- 
ments the ARM V4 instruction set 1 and is bus 
compatible with earlier implementations. 
The pin interface runs at 3.3 V but the internal 
power supplies can vary from 1.5 to 2.2 V, pro- 
viding various options to balance performance 
and power dissipation. At 160 MHz internal clock 
speed with a nominal Vdd of 1.65 V, it delivers 
185 Dhrystone2.1 MIPS while dissipating less 
than 450 mW. The range of operating points 
runs from 100 MHz at 1.65 V dissipating less 
than 300 mW to 200 MHz at 2.0 V for less than 
900 mW. An on-chip PLL provides the internal 
clock based on a 3.68 MHz clock input. The chip 
contains 2.5 million transistors, 90% of which 
are in the two 16 kB caches. It is fabricated 
in a 0.35-fi.m three-metal CMOS process with 
0.35 V thresholds and 0.25 fim effective channel 
lengths. The chip measures 7.8 mm x 6.4 mm 
and is packaged in a 144-pin plastic thin quad 
flat pack(TQFP) package. 
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Introduction 

As personal digiral assistants (PDA's) move into the 
next generation, there is an obvious need for addi- 
tional processing power to enable new applications 
and improve existing ones. While enhanced function- 
ality such as improved handwriting recognition, voice 
recognition, and speech synthesis are desirable, the 
size and weight limitations of PDA's require that 
microprocessors deliver this performance without 
consuming additional power. The microprocessor 
described in this paper — the Digital Equipment 
Corporation SA-110, the first microprocessor in the 
StrongARM family — directly addresses this need bv 
delivering 185 Dhrystone 2.1 MIPS while dissipating 
less than 450 mW. This represents a significantly 
higher performance than is currently available at this 
power level. 

CMOS Process Technology 

The chip is fabricated in a 0.35 u.m three-metal CMOS 
process with 0.35 V thresholds and 0.25 urn effective 
channel lengths. Process characteristics are shown 
in Table 1 . The process is the result of several genera- 
tions of development efforts directed toward high- 
performance microprocessors. It is identical to the one 
used in Digital Equipment Corporation's current 
generation of Alpha chips : except for the remov al of 
the fourth layer of metal and the addition of a final 
nitride passivation required for plastic packaging. 

The factors which drive process development for 
low-power design are similar to those which drive the 
process for pure high-performance although the moti- 
vation sometimes differs. For example, while both 
types of designs benefit from maximizing Idsat of the 
transistors at the lowest acceptable Vdd, the motiva- 
tion for a pure high-performance design is reducing 
power distribution and thermal problems rather than 
extending battery lite. Similar arguments applv to 
minimizing transistor leakage and on-chip variation of 
transistor parameters. This convergence of goals has 
been essential to our ability to develop one process 
to satisfy the requirements of both low-power and 
high-performance families. 
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Table 1 

Process Features 



Feature size 


0.35 urn 


Channel length 


0.25 urn 


Gate oxide 


6.0 nm 


Vtn/Vtp 


0.35V/-0.35 V 


Power supply 


2.0 V (nominal) 


Substrate 


P-epi with n-well 


Salicide 


Cobalt-disilicide in diffusions and gates 


Metal 1 


0.7 um AlCu, 1 .225 urn pitch (contacted) 


Metal 2 


0.7 um AlCu, 1.225 um pitch (contacted) 


Metal 3 


1 .4 um AlCu, 2.8 um pitch (contacted) 


RAM cell 


6 transistor, 25.5 um' 



Power Dissipation Tradeoffs 

RISC' microprocessors operating .it 160 MHz arc fairly 
common using current CMOS process technology. 
The novel aspect of this design is the ability to achieve 
this operating frequency .it power levels which are low- 
enough for hand held applications. Sev eral design 
tradeoffs were made to achieve the desired power 
dissipation. In order to illustrate their effect on the 
design, it is interesting to imagine applying these 
tradeoffs to an earlier design whose power dissipation 
occupies the opposite end of the power spectrum, 
the first reported Alpha microprocessor,* This Alpha 
chip was fabricated in a 0.75-|.im CMOS process and 
operated at 200 MHz dissipating 26 W at 3.45 V. The 
impact of these tradeoffs is summarized in Table 2. 

The first decision is to reduce the internal power 
supply to 1.5 V. This change cuts the pow er bv a factor 
of 5.3. While this has the desired effect, it has implica- 
tions for the cycle time which are considered in the 
section Circuit Implementation. 

The next step is to reduce the functionality. As com- 
pared to the early Alpha chip, the most obvious sec- 
tions missing in this design are the floatingpoint unit 
and the branch history table. Floating point is not 
required in the target applications and the low branch 
latency of this design eliminates the need for the 



Table 2 

Power Dissipation Tradeoffs 



Start with Alpha 2 1064: 200 MHz @ 3.45 V. 



Power dissipation = 


26W 






Vdd reduction: 


Power reduction = 


5.3x 


=» 4.9 W 


Reduce functions: 


Power reduction = 


3x 


=» 1 .6 W 


Scale process: 


Power reduction = 


2x 


=» 0.8 W 


Reduce clock load: 


Power reduction = 


1.3x 


>0.6W 


Reduce clock rate: 


Power reduction = 


1.25x 


0.5 W 



branch history table. Less obvious, bur very impor- 
tant, is reduced control complexity. This is a simple 
machine and we have worked hard to keep it so. We 
estimated that the reduced functionality would cut 
power by a factor of three. 

Process scaling reduces node capacitances and there- 
fore chip power. Note that although the area compo- 
nents of the capacitance will decrease as the square 
of the scale factor, the total capacitance change with 
scaling will be less dramatic primarily due to the effect 
of periphery capacitance. We estimate that scaling 
from 0.75 fxin of the carlv Alpha chip to our current 
0.35 |xm process results in a power reduction of about 
a factor of two, a linear reduction with scale factor. 
Once again, coupled with this positive effect of process 
scaling are a host of other issues. Some of those issues 
are considered in the section Power Down Modes. 

Next, consider the clock power. The clock power of 
the Alpha chips is fairly large and while that clocking 
strategy works well for Alpha machines, it is not appro- 
priate for a low-power chip. Our docking strategy and 
our latch circuits are described in some detail later. 
One major change from the Alpha design was to reject 
the pair of transparent latches per cycle used on the 
Alpha design. Instead, on this design, we switched to a 
single edge-triggered latch per cycle to reduce clock 
load and latch delay. Our estimate is that the changes 
in the clocking reduced the clock power by a factor of 
two. Since the clock power was about 65% of the total 
power on the first Alpha chip, this results in a reduc- 
tion of about 1.3. 

Finally, the reduction in clock frequency from 
200 MHz to 160 MHz drops the power bv 1 .25. 

Clearly, this analysis is nor rigorous, but it suggests 
that iris reasonable to build a 160 MHz processor chip 
that dissipates around half a watt. A similar analysis was 
performed at the beginning of the project to select the 
power supply voltage and operating frequency and to 
determine whether significant changes in design 
method would be required to meet the performance 
and power goals. It is interesting to note that with the 
exception of the clocking changes, the design methods 
and philosophy used on this design were very similar 
to that used on the Alpha chips. 

Instruction Set 

The microprocessor implements the ARM V4' 
instruction set. The architecture defines thirty 32-b 
general purpose registers and a program counter (PC), 
Registers are specified bv a 4-h field where registers 
0 to 14 are general purpose registers (CPU) and regis- 
ter 1 5 is the PC. The current processor status register 
contains a current mode field which selects either an 
unprivileged user mode or one ol six privileged modes. 
The current mode selects which set ol'GPR's is v isible. 
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Ill addition to basic RISC features of fixed length 
instructions and simple load/store architecture, the 
architecture implemented includes several features to 
improve code density. These include conditional execu- 
tion of all instructions, load and store multiple instruc- 
tions, auto-increment and auto-decrement for loads 
and stores, and a shift of one operand in every ALU 
operation. The architecture supports loads and stores of 
8-, 16-, and 32-b tiara values. In addition to the stan- 
dard 32-b computations, there is a 32-b X 32-b multi- 
ply accumulate with a 64-b product and accumulator. 

Chip Microarchitecture 

As shown in Figure 1, the chip is functionally parti- 
tioned into the following major sections: the instruction 
unit (I BOX), integer execution unit (KBOX), integer 
multiplier (MUL), memory management unit for data 
(DMMU), memory management unit for instructions 
(IMMU), write buffer (WR), bus interface unit (RID), 
phase locked loop ( I'LL), and caches for data ( Dcache) 
and instructions (Icache). To minimize pin power and 
support the high-speed internal core, one half of the 
chip area is devoted to the two 16 K caches. The pad 
ring occupies one-third of the cliip area and the proces- 
sor core tills the remaining one-sixth of the chip area. 

The processor is a single issue design with a classic 
five-stage pipeline — Fetch, Issue, Execute, Buffer, and 
Register File Write (Figure 2). All arithmetic logic unit 
(ALU) results can be forwarded to the ALU input and 
there is a one-cycle bubble for dependent loads. 

For example, the pipeline diagram in Figure 2 
shows a SUBTRACT followed by a dependent LOAD. 
Note that at the end of cycle 3, we bypass the result 
from the SUBTRACT back into the ALU to compute 
the load address in evele 4 without stalling the pipe. 
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Figure 2 

Basic Pipeline Diagram 



The third instruction is an ADD which uses the result 
of the previous LOAD. The ADD is held in the Issue 
stage for one additional cycle until the LOAD data is 
available at the end of cycle 5. 

The 1BOX can resolve conditional branches in the 
Issue stage even when the condition codes are being 
updated in the current Execute cycle. Bv providing 
this optimized path, the IBOX incurs only a one-cycle 
penalty for branches taken, so the chip does not 
require branch prediction hardware. For example, in 
the pair of instructions shown in Figure 3, the 
BRANCH and LINK instruction at the (program 
counter) PC of 104 depends on the condition codes 
which are being generated bv the SUBTRACT in the 
previous instruction. The condition codes from the 
Execute stage of the SUBTRACT are available at the 
end of cycle 3, in time to swing the PC multiplexer in 
the IBOX to point at the branch target PC during the 
next Fetch cycle. 

The optimization of the branch path represents a 
power versus performance tradeoff in which perfor- 
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mance won. In our effort to hold the one eyele branch 
penalty, we included a dedicated adder in the I BOX to 
calculate the branch target address and consumed 
additional power in the HBOX adder to meet the criti- 
cal speed path to control the PC] multiplexer. Due to 
critical path constraints, the adder in the IBOX must 
run every evele, even if the instruction is not a branch. 

In the earlv stage of the design, one of our concerns 
was whether the decision to puisne this optimized 
branch path would increase our cvcle time. As the 
design tinned out, our best efforts in this ALU path 
and in the cache access path resulted in nearlv identical 
delays f»r these two longest critical speed p aths. 

Data for integer operations comes from a 31-entrv 
register file with three read and two write ports. 
Sixteen of the registers are visible at any time with 
1 5 additional shadow registers specified bv the archi- 
tecture to minimize the overhead associated with initi- 
ating exceptions. The HBOX contains an ALU with a 
full 32-b bidirectional shifter on one of the input 
operands. It includes bypassing circuitry to forward 
the data from the Dcache or the ALU output to any 
of the read ports. Figure 4 shows the circuit blocks 
invoked in the branch path. The path starts at a larch 
m the bvpasscrs and, in a single cycle, includes a 
0-ro 32-b shift, a 32-b ALU operation, and a condi- 
tion code computation to swing the PC multiplexer 
for the next cvcle. The registers to hold the condition 
codes were implemented in the HBOX so that this 
path could be locally optimized. Analysis of code 
traces indicated that most ALL' operations included a 
shift of zero, so for this case, the shifter is disabled and 
bypassed to reduce pow er. 

The HBOX also contains a 32-b multiply/accumu- 
late unit. The multiplier consists of a 12- by 32-b 
carry-save multiplier array which is used for one to 
three cycles depending on the value of multiplicand 
and a 32-b final adder to reduce the carry-save result. 
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For multiply accumulate operations, the accumulate 
value is inserted into the array so that an additional 
cycle is not required for the Multiplies with 
Accumulate. Multiply Long instructions require 
one additional cycle. This results in a MULTIPLY or 
MUI.TIPLY/ACCUMULATH in two to four cycles 
and MUL LONG or MUI. LONG/ACCUMULATE 
in three to five cycles. 

The Wallace tree implementation was chosen to 
minimize the delay through the array. This implemen- 
tation required careful floor planning and custom lav- 
out to keep the wiring under control. The decision to 
perform 12 b of multiply per cycle was based on wiring 
tradeoffs made during the physical planning phase of 
the design rather than critical path concerns. When the 
multiplier is not in use, all clocks to the section stop 
and the input operands do not toggle. 

fhe chip features separate 16 kBvtc, 3 2 -way set 
associative virtual caches for instructions and data. 
Hach cache is implemented as 16 fully associative 
blocks. Each cache is accessed in a single cycle for both 
reads and writes, providing a two-cycle latency for 
return data to the register file. On-, eighth of each 
cache is enabled for a cache access. 

fhe Dcache is writeback with no write allocation, 
'fhe block size is 32 bvtcs with dirty bits provided for 
each half block to minimize the data which needs to be 
casrout in the ev ent of a dirty victim. The physical 
address is stored with the data to avoid address transla- 
tion during castouts. 

Civ en the size of the caches and the low power 
target for the chip, it was important that we have fine 
granularity of bank selection. In addition, we required 
associativity of at least four-wav for cache efficiency 
and it was important to performance that we maintain 
a single cvcle access. We considered several solutions 
to this problem, including traditional four-way set 
associative caches, and decided that the simplest 
approach which satisfied all the requirements was to 
implement the caches as smaller, bank-addressed, fully 
associative caches. This resulted in 32-wav associativity 
but this lev el of associativity was a side effect of the 
implementation used, not the result of a goal to get 
associativity significantly above four-wav. 

fhe chip includes separate memory management 
units (MMU) for instructions and data. Each MMU 
contains a 32 -entry fully associative translation look- 
aside buffer (TLB) with entries which can map either 
4 kB, 64 kB, or 1 MB pages. TLB fills arc implemented 
in hardware. In addition to the standard memory 
management protection mechanisms, the ARM archi- 
tecture defines .in orthogonal memory protection 
scheme to allow the operating system easy access to 
large sections of memory without manipulating the 
page tables. This functionality requires a set of addi- 
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tional checks which must be performed after rhe TI.B 
lookup. The resulting critical path was sufficiently 
long that we self-timed the RAM access in the TLB to 
allow us to perform the lookup and complex protec- 
tion checks in a single cvclc. 

A write buffer with eight 16-byte entries handles 
stores and castouts from rhe Dcache. The write buffer 
includes a single-entry merge latch to pack up sequen- 
tial stores to the same entry. 

During normal operations, an external load request 
takes priority over stores on the pin bus. How ever, in 
rhe event of a load which hits in the w rite buffer, the 
chip executes a series of priority stores which raises the 
prioritv of rhe Write Buffer on the external bus above 
that of any loads. External stores occur and the write 
buffer empties until the store which was pending at 
the load address completes. At this point, top priority 
reverts back to loads. 

Power Down Modes 

'['here are rwo power down modes supported by the 
chip — Idle and Sleep. 

Idle mode is intended for short periods of inactivity 
and is appropriate for situations in which rapid 
resumption of processing is required. In Idle mode, 
the on-chip PL. I, continues to run but the internal 
clock grid and rhe bus clock stop toggling. This elimi- 
nates most activity in the chip and the power dissipa- 
tion drops from 450 mW to 20 mW. Return from Idle- 
to normal mode is accomplished with essentially no 
delay bvsimplv restarting the bus clock. 

Sleep mode is designed for extended periods of inac- 
tivity which require rhe lowest power consumption. 
The current in Sleep mode is SO (xA which is achiev ed 
bv tinning off the internal power ro the chip. The 3.3 V 
I/O circuitry remains [lowered and rhe chip is well 
behaved on the bus, maintaining specified levels if 
required by rhe drive enable inputs. Return from Sleep 
ro normal operation takes approximately 140 u-S. 

As was noted earlier, a low voltage process is key 
to the design of a microprocessor which will run at 
160 MHz while dissipating less than 450 mW. 
However, rhe same low device thresholds which allow 
the reduction of Vdd also result in significant device 
leakage. While this leakage is not large enough to 
cause a problem for normal operation, it does pose 
problems for standby current, especially if the pro- 
cess skews toward short channel devices. Our initial 
analysis indicated that the chip would dissipate over 
100 mW in Idle mode with the clocks stopped. To 
reduce this leakage, we lengthened devices in the 
cache arrays, the pad drivers, and certain other areas. 
This brought the leakage pow er ro within the required 
value of 20 mW in rhe fastest process corner. As a 
backup, we relaxed our design rules to allow the 



remaining gare regions, which are drawn with a stan- 
dard 0.35 pan gare length, to be biased up algorithmi- 
cally without violating design rules in case it was 
necessary ro meet the leakage requirements. 

The requirement for standby power in Sleep is more 
than two orders of magnitude low er than the Idle- 
power. To meet the power limit in Sleep, we consid- 
ered a variety of options including integrated power 
supply switches and substrate biasing schemes before 
choosing the simple approach of turning off the inter- 
nal supply. This approach is reasonable for this genera- 
tion of" parts since thev have a dedicated low voltage 
supply. As more parts of the system shift to the low 
voltage supply, this may no longer be acceptable. The 
conflicting requirements of high performance at low 
voltage and low standby current promise to create 
interesting challenges in future designs. 

The power switch to turn off the internal power 
supply during Sleep is implemented off-chip as part 
of the power supply circuit for the low voltage supply. 
No state is stored internally during Sleep since in 
typical PDA systems, the Sleep stare corresponds ro 
the user turning rhe system off. ' Therefore the time- 
associated with reloading rhe cache upon return from 
Sleep is acceptable. 

The requirements in Idle and Sleep complicated the 
design of the bus interface circuits. 'This section 
includes the level-shifting interface betw een the inter- 
nal low voltage ( 1.5 to 2.2 V) signals and the 3.3 V 
external pin bus. The bus interface circuits must drive 
and receive signals which are higher voltage than those 
nominally supported by the 0.35-|xm process without 
using circuits which would cause us to exceed the cur- 
rent limit specified by the Idle spec. In addition, dur- 
ing Sleep the pads must be able to sustain the value- 
on the output pins despite the loss of internal Vdd 
(Vddi) from rhe low voltage supply which is powered 
off bv the system. The circuitry used to implement this 
function is shewn, in f igure 5. 

Since Vddi will be driven ro zero bv rhe system 
during Sleep, ir is used nor only as a power supply 
but also as a logic signal. All circuitry which must 
be active in Sleep is driven from rhe external, 3.3 V 
supply (Vdd.x) which has been dropped through diode- 
connected PMOS devices to reduce the stress on the 
oxide of these devices. Before signaling rhe chip to 
enter Sleep, rhe system asserts rhe iiRHSHT pin (active 
low) which drives all enabled outputs to a specified 
state — disabled for control signals and zero for 
addresses and data. It then asserts nPVVHSLP (active- 
low) which is ANOcd with the appropriate output 
enable control ro turn on small leaker devices w hich 
will hold the output pin in the appropriate state during 
Sleep. In the circuit shown in Figure 5, the output is 
an address. Therefore, the address bus enable (ABH) 
pin is rhe conrrol pin on the lower NMOS leaker and a 
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buffered version of n PWRSLP controls the top device. 
Finally, the Vddi pins are actively driven to zero by the 
system. This action disables the output stage of the 
pad driver circuit by turning off the transistors closest 
to the pad — the NMOS dircctlv and the I'M OS via flic- 
bias network whose output goes to Vddx when its 
path to Vss is cur off. Note that for any input whose 
value is required during Sleep (ABE and nPWKSLP in 
the example described), a separate parallel input 
receiver must be implemented since the normal input 
receiver requires Vddi. 

Circuit Implementation 

The circuit implementation is pscudostatic and allows 
the internal clock to be stopped indefinitely in either 
state. Use of circuits w hich might limit low voltage 
operation was strictly controlled and the design was 



simulated to ensure operation significantly below 
the nominal 1 .5 V level of the low voltage supply. The 
values of the internal supply and operating frequency 
were optimized to achieve maximum performance for 
less than half a watt. 

The vast majority of the design is purely static, 
composed of either complementary CMOS gates or 
static differential logic. In certain situations, wide 
Nonfunctions were required and these were imple- 
mented in a pseudosratic fashion using either static 
weak feedback circuits or self-timed circuits to latch 
the output data and return the dynamic node to its 
precharged state. 

The register file (RF) uses the self-timed approach 
to return the bit lines to the precharged state after an 
access (Figure 6). In this circuit, an extra self-riming 
column of bit cells with a dynamic bit line was imple- 
mented to mimic the timing of the data bit lines. 
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figure 6 shows one cell from .1 column of register file 
data hit cells and one cell from the extra self-timing 
column (onlv one read port is shown). The hit cells 
in this extra column are all tied off so that the 
SHI .F_BITLI NH signal will always discharge when 
the RHAl)_WORJ)LINF goes high. When the 
SKI.I-_BITUNK falls, it will set an RS latch causing the 
SHI.F_HNABI.1-; signal to fill. This will disable the 
RHAn_W()lU")LINH and cause the bit lines to be 
precharged high when the read access is complete. 
Since the DATA_BITLINK\s are received bv low sensi- 
tive RS latches, the output data will be held when the 
bit line is precharged high. The self-timing RS latch is 
cleared when (,LOCK_L goes low. This causes the 
SKI,F_KNABLK signal to go high, enabling the read 
port for the access in the next clock cycle. A separate 
SHI ,F_Brn ,1 NK signal is implemented for each of the 
three register file ports so that the clocks for the three 
ports can be enabled independently. 

The transistor leakage associated with the low- 
threshold voltages is problematic lor these pseudo- 
static circuits. If a weak feedback circuit is used in a 



NOR structure which is precharged high, excessive 
leakage in the parallel NAIOS pulldowns would 
require that the feedback be fairly strong, w hich in turn 
would reduce the speed of the circuit. In the limit of 
verv wide N'OR's, it may not be possible to si/.e a 
PMOS leaker so that it can supply the leakage of all the 
off NMOS pulldow ns without making the leaker too 
large to be overpowered by a single active pulldown. 
In the case of a self-timed approach, a similar problem 
exists but it usually is manifested as a vanishingly small 
timing margin for the self-timed circuit to fire before 
the data on the dynamic node decays awav. In either 
case, we addressed this issue by requiring the length of 
pulldow ns on dynamic nodes to be slightly larger than 
minimum. Transistor leakage current is a strong func- 
tion of channel length so a 12% increase in device 
length results in a leakage reduction in the worst case 
of about a factor of 20. The resulting leakage makes 
implementation of either weak feedback or a self- 
timed approach verv reasonable. 

The operating frequency at 1.5 V can be roughly 
derived by starting with the frequency of the Alpha 




WRfTEWORDLINE 



ADDRESS_0ECODE 



CLOCK L 



Figure 6 

Sell'-riiiicd RF Piccbar«;c 



Digital Technical Journal 



Vol. 9 No. 1 1997 



processor in the same process technology- and scaling 
for the use of a longer rick model and then Vdd. Since 
the long tick design requires the chip to perform a full 
SHIFT and a full ADD in a single cycle, this approxi- 
mately doubles the cycle time required. The effect of 
Vdd scaling is roughly linear for this range of Vdd. 
Combining these effects results in an operating 
frequency at 1.5 V given by 

433 MHz * 0.5 * ( 1 .5 V/2.0 V) 162 MHz. 

This pair of voltage and frequency values agrees w ell 
with the power estimate outlined in the section Power 
Dissipation Tradeoff's. Note that for power supply 
voltages much lower than 1.5 V, the operating fre- 
quency decreases with voltage in a manner which is 
significantly stronger than linear. This fact sets a prac- 
tical lower limit on the pow er supply \olrage in most 
applications. 

Power estimates made early in the design are prone to 
errors in either direction. In the case of this design, the 
power dissipated at I .5 V was lower than the 450 mW 
target, so we shifted the nominal internal Vdd to 1 .65 V 
to increase the yield in the 160 MHz bin. 

Clock Generation 

An on-chip PI .1.' generates the internal clock at one of 
16 frequencies ranging from SS to 287 MHz based on 
a fixed 3.68 MHz input clock. Due to internal 
resource constraints and our limited experience with 
low-power analog circuits, wc contracted with Centre 
Suisse d'Electronique et de Microtechnique (CSKM) 
from Neuchatel, Switzerland, to design the PI. I, and 
engaged Professor T. Lee from Stanford as a consul- 
tant on the project . Our initial feasibility work resulted 
in sev eral design tradeoffs; 

First, w hile there w as a system requirement that the 
chip return quicklv from the Idle state to normal oper- 
ation, there was no such constraint on returning from 
the Sleep stare. Based on this determination and our 
20 mVV power budget in Idle, we concluded that if we 
could keep the I'LL power below 2 mW, we could 
leave the PLL running in Idle and remove the require- 
ments on the PI. J . lock time. Thus, the need for a very 
low power PIT is dictated by the power budget in 
Idle, not in normal operation. 

Next, wc had specified a large frequency multiplica- 
tion factor to allow the use ofo common and cheap low 
frequency crystal clock source for consumer products. 
Larlv investigations indicated that this would make 
light phase locking difficult. However, when we 
looked at target systems, we found no pressing need for 
phase locking. Consequently, we removed phase lock- 
ing as a design criteria and concentrated oar efforts and 
design tradeoffs on minimizing phase compression. 



Finally, while the PLL was designed to handle the 
noise expected on the chip power supplies, we discov - 
ered toward the end of the design that the PLL was 
under its area budget and there was additional space 
available in the vicinity. We took advantage of this 
opportunity to provide cleaner power to the PLL bv 
RC filtering our internal supply and wc dedicated 1 nF 
of on-chip decoupling cap to this purpose. 

CSFM performed the circuit and layout design 
and we placed the completed block into the micro- 
processor. Since we anticipated that the characteriza- 
tion of the PLL integrated in the chip would present 
some difficulties, we reserved one of the six die sites 
on our first pass reticle set for a test chip which con- 
tained several variants of the full PIT and interesting 
sub-blocks. These circuits allowed access to a variety of 
nodes in the Pl.l. without compromising the design of 
the PLL instantiated in the chip. The results of the 
PI ,1. characterization are reported in Reference 4. 

Clock Distribution 

The chip operates from two clocks as shown in Figure 7. 
An internal clock, called DCLK, is usually generated 
by the PLL. The second clock is a bus clock, known as 
MCl.K which operates up to 66 MHz. MCI.K can be 
supplied bv an external asynchronous source or by the 
chip based on a division of the Pl.l . clock signal. 

There are five clock regimes in the chip. The first 
two regimes are sourced bv MCI.K and consist of the 
pad ring which receives MCI.K directly and the bus 
interface unit ( 151 U) and part of the write buffer which 
receive MCI.K through conditional clock buffers. The 
last three regimes are sourced bv the internal DCLK 
clock tree and contain the Dcache, the Icache, and the 
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core. In this case, the core includes the I BOX, KBOX, 
MUL, IMMU, 1)MMU, and parr of the write buffer. 

Roth MCLKand DCI.K are distributed by buffered 
H -trees to conditional clock buffers in the various sec- 
tions of the chip. The buffers in the H-tree allow the 
use of smaller lines for distribution and result in lower 
clock power. Although the three internal clock 
regimes are all sourced by the same H-tree, the topol- 
ogy of the chip did not allow corresponding sections 
of the H-tree to be routed in the same metal. This 
resulted in an increase in the expected skew between 
the caches and the core. In addition, we discovered 
that we could squeeze a bit more performance from 
the chip if we intentionally offset the clock in the 
caches relative to the clock in the core. Consequently, 
we used the clock buffers in the H-tree to tune the 
clock so that the Dciche receives a clock w hich is one 
gate delav earlier than the core and the I cache receives 
a clock which is one gate delay later than the core. 

Figure 8 shows the physical routing of the internal 
clock tree. The buffer stages are not shown but thev 
exist in the center of the chip and in four symmetric 
locations — two in the center of the I and D caches and 
two in locations at the cache/core interface. The final 
leg of the H-tree is tied to conditional clock buffers in 
the caches and the core. The problems associated with 
clock skew within the caches are reduced by the fact 
that only a single bank in each cache is enabled. This 
limits the physical distance over which tightly con- 
trolled clocks need to be delivered in the cache regions. 

The clocks in the core present a more interesting 
problem. The final leg of the clock tree in the core 
stretches the full height of the chip and tight control of 
skew along this node is required for speed and func- 
tionality. It is implemented as a vertical, metal 2 line 
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Figure 9 

Clock Arrival Time in the Core 



driven from four nominally equidistant points. The 
clock buffers are standard cells of varying drive 
strength built directly under this M2 line to minimize 
local variation in delay. 

Circuit simulations of the H-tree were performed 
using SPICK to determine the skew between clock 
regions and within each of the clock regions. The 
nodes in the grid were extracted f rom knout and con- 
tained more than 30,000 R and C elements. Figure 9 
shows the relativ e clock arrival time versus the Y coor- 
dinate for each conditional clock buffer on the vertical 
leg of the clock tree in the core. The four arrows on 
the graph indicate the points from which the final leg 
is driven. The data points are the relative arrival rimes 
of the clock input to the conditional clock buffers 
sourced by the clock tree. The total simulated skew is 
41 pS assuming maximum metal resistance. 

Clock Switching 

One additional complication related to the internal 
clock tree is that it is not always driven by the clock 
from the PIT, known as CCLK. During cache fills, the 
clock source for the internal sections of the chip 
switches ox er to MCLK so that the whole chip is run- 
ning synchronous to the bus (Figure 10). This simpli- 
fies fills and it reduces power since the bus clock is 
significantly slower than CCLK. Note that since this 
machine has a blocking cache, not much happens 
while waiting for a cache fill. Therefore, running on 
the slower bus clock during fills has essentially no 
performance impact. 

Since MCLK and CCLK might be asynchronous, 
switching the driv er of DCLK quickly between the two 
clock sources is difficult. Careful attention must be 
paid to the synchronization of the Mux control signals 
to prevent glitch pulses on the clock during the transi- 
tion between the clock sources. 
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Clock switching is only used during rills. Stores 
which miss in the cache and castouts arc written to 
memory through the write buffer without switching 
the internal clock over to MCI.K. The write buffer 
receives both DCLK and MCLK and passes the data 
for external stores across the DCLK/MCI.K inter- 
face with propei' attention to synchronization issues 
between the two clock regimes. One interesting char- 
acteristic of clock switching is that it gives the system 
designer another option to save power in situations for 
which the full performance of the chip is not required. 
Bv disabling clock switching on the fly, you can config- 
ure the chip to run off the bus clock. There is no limit 
on asymmetry or maximum pulse width of the bus 
clock, so the chip can be operated at very low frequen- 
cies if desired. 

Conditional Clock Buffers 

Conditional clock buffers are simple NAND/invert 
structures with an integral larch on the condition 
input. The buffers must be matched to their J»ad 

to minimize skew. Since adding dummy clock loads 
is contrary to the low-power design philosophy, we 
created scaled clock buffers which would produce 
matched clocks for a wide range of loads and only 
needed to add dummy clock loads for a small number 
of very lightly loaded clock nodes. The task of match- 
ing the clock buffers to the load was greatly simplified 
by the fact the clock load presented by our standard 
latches is largely data-independent. 

While the use of conditional clock buffers is central 
to the design method used on the chip, it should be 
noted that the critical paths to generate the condition 
input to these buffers represent some of the most diffi- 
cult design problems in the chip. In this case, we 



decided that the power saving associated with the con- 
ditional clocking was worth the additional design 
effort and possible performance reduction. 

Latch Circuits 

The standard latches used in the design are differential 
edge-triggered larches (Figure 11). The circuit struc- 
ture is a precharged differential sense amp followed bv 
a pair of cross-coupled NANO gates. The sense amp 
need not be particularly well balanced because the 
inputs to the latch are full CMOS levels. The NMOS 
shorting device between nodes L3 and L4 provides a 
dc path to ground for leakage currents on nodes LI 
and L2 in case the inputs to the latch switch after the 
latch ev aluates. At normal operating frequencies, this 
device is not particularly important but it is required 
for the latch to be static. Note that since the dc current 
flowing is due only to dev ice leakage, the magnitude 
of the current is insignificant to the power in normal 
operation. 

Testability 

The chip supports IF.F.F 1149.1 boundary scan for 
continuity testing. In addition, it has two hardware 
features to aid in manufacturing testing. The first is a 
bypass to allow CCLK. to be driven from a pin synchro- 
nous to MCI.K. This allows the tester to control the 
timing between CCLK and MCLK to make the asyn- 
chronous sections appear to be deterministic. The sec- 
ond test feature prov ides a linear feedback shift register 
(LFSR) that can be loaded with instruction data from 
the Icachc. Loading the LPS 11 can be conditioned 
based on the value of address bit 2 and the Icachc hit 
signal. The LFSR is loaded after the Fetch stage to 
allow the instruction following a branch to be read 
from the Icachc and loaded into the LFSR. This fea- 
ture allows any random pattern to be loaded into the 
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Icachc rind then read out by alternating branch 
instructions with data patterns words. 

Power Dissipation Results 

Measured Results 

Power dissipation data was collected on an evaluation 
board running Dhrystone 2.1 with the bus clock 
running at one-third of the PLL clock frequency. 
Dhrystone fits entirely in the internal caches so, after 
the first pass through the loop, pin activity is limited. 
This is the highest power case because cache misses 
cause the internal clocks to run at the bus speed and 
result in a lower total power. For both sets of measure- 
ments, external Vdd is fixed at 3.3 V. For an internal 
Vdd of 1.5 V, the total power is 2.1 mVV/MHz. If 
the internal supply is set to 2.0 V, the total power is 
3.3 mW/MHz. Note that the ratio of the power at 
1 .5 and 2.0 V does not track Vdd 2 because it contains 
a component of external power and the external Vdd 
is fixed. 

Simulated Power Dissipation by Section 

An analysis of node transitions based on simulation 
was performed to estimate the power dissipation asso- 
ciated with the various major sections of the chip 
(Table 3). Toggle information was collected based on 
160,000 cvclcs of Dhrystone and combined with 
extracted node capacitances to estimate power dissipa- 
tion bv node and this data was further grouped by sec- 
tion. The clock power listed in Table 3 is due only to 
the global clock circuits. 

A few points are worth noting. 

■ First, the power is dominated by the caches as 
you might expect given their size. This is despite 
our efforts to reduce their power through bank 
selection and other means. The I cache burns 
more power than the Dcache because it runs 
every cycle. 



Table 3 



Simulated Power Dissipation by Section 


ICACHE 


27% 


IBOX 


18% 


DCACHE 


16% 


CLOCK 


10% 


IMMU 


9% 


EBOX 


8% 


DMMU 


8% 


Write buffer 


2% 


Bus interface unit 


2% 


PLL 


<1% 



■ Next, the PLL power is insignificant in normal oper- 
ation. As was noted earlier, its low power character- 
istics are only important in Idle. 

■ Finally, since reduction in clock power was one of 
our explicit goals, it is interesting to consider the 
total clock power. If you extract the local clock 
power from the nonclock sections and sum it, you 
get a total clock power, including the global clock 
trees, the local clock buffers and the local clock 
loads. This power is 25% of the total chip power, 
significantly less than the 65% consumed bv the 
clocks in the Alpha microprocessor used in our ini- 
tial feasibility studies. 

Conditional clocking was an integral part of the 
design method, so it is difficult to determine the 
power saving associated with it. However, the power 
associated with driv ing the conditional clocks is 15% 
of the chip power and if the conditions on all the 
conditional clock buffers were always true, this power 
would quadruple. This does not account for the 
additional power savings that has been achieved bv 
blocking spurious data transitions. 

CAD Tools 

The CAD tools used on this chip were largely the same 
as those used on our Alpha designs.' This is not sur- 
prising since the performance target of the chip 
roughly parallels that of the Alpha familv as noted 
in the section Circuit Implementation. The most sig- 
nificant departure was in the area of static timing 
verification and race analysis where the adoption of 
edge-triggered latching required significant modifica- 
tions to the tools used in the Alpha designs. 

Project Organization 

One of the challenging aspects of this project was 
geographical. The detailed design was performed at 
four sites across a nine hour time zone range. The ini- 
tial feasibility work and architectural definition was 
done at Digital Semiconductor's design center in 
Austin with on-site participation bv personnel from 
Advanced RISC Machines Limited (ARM). The 
implementation was more widely distributed with the 
caches, MMU's, write buffer, and bus interface unit at 
Digital Semiconductor's design center in Palo Alto, 
the instruction unit, execution unit, and clocks in 
Austin, the pad driver and ESD protection circuits at 
Digital Semiconductor's main facility in Hudson, 
MA, and the PI. I. at the CSEM design center in 
Neuchatel, Switzerland. In addition, we consulted 
with Hudson for CAD and process issues, with ARM 
in Cambridge, England, for all manner of architec- 
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rural issues and implementation tradeoffs associated 
with ARM designs and with T. Lee from Stanford on 
the PLL. The implementation phase of the project 
took less than nine months with about 20 design 
engineers. 

Conclusion 

The microprocessor described uses traditional high 
performance custom circuit design, an intentionally 
simple architectural design, and advanced CMOS 
process technology to produce a 160 MHz micro- 
processor which dissipates less than 450 mVV. The 
internal supplies can vary from 1 .5 to 2.2 V while the 
pin interface runs at 3.3 V. The chip implements the 
AR.M V4 instruction set and delivers 185 Dhrvstone 
2.i MIPS at 160 MHz. The chip contains 2.5 million 
transistors and is fabricated in a 0.35-pm three-metal 
CMOS process. It measures 7.8 mm X 6.4 mm and is 
packaged in a 144-pin plastic thin quad Hat pack 
(TQFP) package. 
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